Table of Contents: 2015 MAY - JUNE No. 404

RefSeq Release 70 Now Available with Re-annotated Bacterial Genomes for Uniformity Across Genomes and Species

RefSeq Release 70 Now Available with Re-annotated Bacterial Genomes for Uniformity Across Genomes and Species. NLM Tech Bull. 2015 Mar-Apr;(404):b7.

2015 May 11 [posted]

[Editor's Note: This is a reprint of an announcement published on NLM/NCBI List ncbi-announce, an e-mail announcement list available from the NLM/NCBI. To subscribe to this list, please see the ncbi-announce -- NCBI announcements and updates page.]

The full Reference Sequence (RefSeq) release 70 is now available online, on the FTP site, and through NCBI's programming utilities, with 74,720,563 records describing 50,351,119 proteins, 11,310,700 RNAs, and sequences from 54,118 different organisms.

This release reflects a large update of complete bacterial RefSeq genomes, proteins, and Genes. In order to make genome annotation comparable across genomes and species, NCBI has re-annotated all RefSeq prokaryotic genomes using NCBI's genome annotation pipeline. Previously, it was possible that the same gene, in the same species, with an identical sequence for the gene's genomic region might be annotated with a different protein simply because it was annotated using different methods. Now, the same gene in the same species with the same sequence will be annotated with exactly the same protein in RefSeq.

In addition, each annotated CDS used to be tracked with a distinct RefSeq protein accession number. However, due to identical protein sequences being found on multiple re-annotated RefSeq genomes and extensive bacterial genome sequencing, the RefSeq prokaryotic protein dataset rapidly became very redundant. Rather than flood the protein database with thousands of completely identical proteins, NCBI has adopted the use of non-redundant WP proteins for RefSeq prokaryotic genomes annotated with NCBI pipelines, which we first announced in June 2013. Now, if the identical protein sequence appears on more than one RefSeq genome, NCBI simply reuses the existing WP accession number instead of creating a new accession for each new occurrence and genome. As a result, over 7 million proteins were removed, significantly reducing protein redundancy for the prokaryotic dataset. A removed accession report (release70.removed-records.gz) and a supplemental data mapping file (release70.bacterial-reannotation-report.txt.gz) are available in the release-catalog directory on FTP.

This is a first step toward managing data in a world where genomes are sequenced for assays, rather than to discover novel proteins. We appreciate that this is a new and major change for RefSeq prokaryotic genomes, but it is also a necessary change to make as the number of disease-outbreak and other isolate sequencing continues to rapidly increase. For more information on changes to protein records, nucleotide records, the impact to NCBI Gene, and future plans, please see the latest story on NCBI News: http://www.ncbi.nlm.nih.gov/news/05-07-2015-refseq-release-70-reannotation.

NCBI has created documentation to explain these changes in detail:

RefSeq Re-annotation Project: An explanation of what the re-annotation project is, why and how it was done, and how we will facilitate your transition to the new annotation data can be found here http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/reannotation/.
RefSeq non-redundant proteins: A description of this new protein record type with examples can be found here http://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/.
Prokaryotic RefSeq Genomes: The prokaryotic RefSeq genomes policy, as well as definitions for reference genomes and representative genomes can be found here: http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/.
Prokaryotic annotation pipeline: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/.
Prokaryotic RefSeq FAQ: http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/faq/.
Supplemental data mapping file: An FTP file in the release-catalog directory (release70.bacterial-reannotation-report.txt.gz) has been prepared for re-annotated complete genomes that have recently transitioned to using the new non-redundant proteins. This file reports the old protein accession and GI, the annotated CDS coordinates, the old locus_tag and NCBI GeneID values and maps to the current non-redundant protein accession and GI, the new locus_tag and NCBI GeneID (if available), the current CDS annotation coordinates, and indicates then the original protein identically matches verses is similar to the replacement non-redundant protein or was dropped from the annotation.
Supplemental report of suppressed assemblies: An FTP file in the release-catalog directory (release70.addedQA-SuppressedAssemblies.txt) reports details for a subset of bacterial genomes that were suppressed in March 2015 following an expansion of QA metrics and subsequent to curatorial review. This report illustrates some of the reasons for suppression.

If you have more questions or specific questions that are not addressed in the documentation, you can write to the Help Desk at info@ncbi.nlm.nih.gov or use the feedback form on the RefSeq page.