Exercise 2: Learn about the pathogen's genome

Task: Find a genome record for the pathogen and access related genomic sequence data

To perform helpful bioinformatic analysis for your research, it is imperative to find and download high quality sequence data. NCBI accepts nucleotide sequence data from research labs all over the world, but we also have a curation group who uses this and published information to create high quality sequences and datasets for use in computational work.

You will search the Genome database to find a record for the pathogen, identify a genome sequence with assembly/annotation statistics (details in the Assembly database), and learn how to download information for your pathogen such as genome, transcriptomes and proteome sequences.

Background

Where do the nucleotide sequences at NCBI come from?

Primary Nucleotide Sequence Repositories of the International Nucleotide Sequence Database Collaboration (INSDC) (https://www.insdc.org/):

These databases accept, store and share "primary sequences" - those nucleotide sequences who have been identified, submitted and are still “owned” by the researchers (submitters).

The databases contain records of varying quality for both their metadata (descriptive information) and the sequences.
They also contain sometimes highly redundant sets of data provided independently from a few to hundreds of research labs.

NCBI has developed the Reference Sequences (RefSeq) Project (https://www.ncbi.nlm.nih.gov/refseq/) to create/curate high-quality nucleotide and protein sequences from submitted data supplemented with information from peer-reviewed, published literature. The data is produced and therefore “owned” and updated by NCBI.

The project aims to:

Records represent all molecules in the central dogma
Provide reference standards
Eukaryotes: genomic, mRNA & ncRNA, proteins
Prokaryotes and Viruses: genomic, ncRNA & protein (no mRNA records)

These are created with distinct accessions with a “prefix and underscore (_)”

genomic: NC_, AC_, NG_, NZ_
RNA: NM_, NR_, XM_, XR_
protein: NP_ (YP_), XP_, *WP_

*NOTE on on the WP accession.

A developing issue in GenBank and now in RefSeq: We have over 200,000 RefSeq bacterial assemblies - many of them have identical protein sequences - producing redundant, redundant, redundant protein records.
A Solution: Make one copy of a “shared protein sequence” to link all annotations in the Identical Proteins Group database.
For example: The gene carbapenem-hydrolyzing class A beta-lactamase is annotated on more than 4,700 Bacterial genomic assemblies. It's encoded protein is included in the Identical Proteins Group (IPG) database and report with the accession: WP_004199234.1, MULTISPECIES Taxonomic Group carbapenem-hydrolyzing class A beta-lactamase KPC-2 [Bacteria]

RefSeq Genome data at NCBI

Viruses may have one or more reference genomes per species and chosen assemblies are based on the designated exemplar(s) of the International Committee on Taxonomy of Viruses (ICTV) .

Prokaryotes may have more than one reference or representative genomes per species.

RefSeq reference genomes are selected based on assembly and annotation quality, existing experimental support, and recognition as a community standard (ex: Escherichia coli str. K-12 substr. MG1655) or of clinical importance (ex: Escherichia coli O157:H7 str. Sakai or Mycobacterium tuberculosis H37Rv).
RefSeq representative genomes are assigned to type strain assemblies if there is no current reference genome or another one if it is scientifically significant and exhibits strong sequence diversity as compared to the assigned reference genome(s) (ex: Mycobacterium avium subsp. paratuberculosis K-10 or Streptococcus thermophilus JIM 8232).

Eukaryotes (incl. fungi & helminths) - no more than one reference or representative genome per species.

RefSeq reference genomes are selected based on assembly and annotation quality, existing experimental support, and recognition as a community standard or of clinical importance (ex: Aspergillus fumigatus Af293).
If there are no assemblies in RefSeq for a particular eukaryotic species, then RefSeq will select a representative genome from the highest quality GenBank assembly (ex: Schistosoma mansoni ASM23792v2).

For more information: https://www.ncbi.nlm.nih.gov/assembly/help/

Key NCBI Resources for this Exercise

NCBI Genome database - a catalog of species-level genome-specific information. It includes information and links to related data as curated by the RefSeq team and generated by annotation pipelines. The RefSeq Group handles viral & bacterial genome data differently than eukaryotic organisms, such as fungi and humans.

NCBI (Genome) Assembly database - a repository of genome sequence assemblies with information about submitters, statistics and links to the actual sequences.

Your Turn: Learn about the pathogen's genome!

Use the name of your patient's pathogen to begin your search following the steps below.

Click below if you need a hint on what organism you found:

Identified viral isolate

A graphic with the answer, Measles morbillivirus is the infectious viral isolate.

Identified bacterial isolate

A graphic with the answer, Salmonella enterica (Typhimurium) is the infectious bacterial isolate.

Identified fungal isolate

A graphic with the answer, Candida auris is the infectious fungal isolate.

Find the Genome record for your pathogen

Go to the Genome database homepage (https://www.ncbi.nlm.nih.gov/genome) and begin typing in the name of the pathogen into the search text box.
As you type, you'll see "autocomplete" display some names to help you!
(This is based on information in our Taxonomy database - which we'll discuss in exercise 3.)
Click on the name of the pathogen you are looking for.
You may get directly to a Genome record page for an organism or to a list of possible organisms. If you get the list, click on the name of the one you'd like to focus on.
If you need it, you can click here to get to a link for the pathogen Genome record page.

At the top of this Genome record page is a box of quick links to data that people have indicated is what they most want.

What types of data can you get here?
For example, click on “tabular", what is this data?

The links in this box make it quick and easy to download genome and proteome sequence data for a reference genomic assembly or get a metadata table of gene information.

NOTE: There is new resource (currently in "beta") which will eventually take over this role as a quick access to download bulk genome-related datasets. We'll cover this resource in the "New Resource: NCBI Datasets" section.

Learn more about the reference or representative genomic assembly

In the box at the top of the Genome record page, click on the link to the Reference Genome (or Representative, if there is no reference one). This will take you to a Genome database Assembly and Annotation report.

What can you learn about this genome?

In the Summary section, click on the hyperlinked Assembly Accession - to get to the NCBI (Genome) Assembly record page for even more information.

Find key information for relevant genomes and assemblies

Go back to that Genome summary record.....

If you need it, click here to get the link.

For your virus:

1. In the “Discovery column” on the right - click "other genomes for this species" and it will take you to the Nucleotide database.

You can download some interesting datasets by clicking on "Send to" in the upper-right hand corner of the page.

For your bacteria or fungus:

1. In the “Discovery column” on the right - click "Assembly" and it will take you to that NCBI (Genome) Assembly database again retrieving a whole lot of assemblies for your pathogen!

- - Use the filters on the left to refine the list to characteristics that you'd want (such as Assembly level: "Complete genome).

- - You can click on any of these to learn more details about the assembly.

1. A new effort at NCBI is to be able to simply and quickly extract just the datasets you want from one assembly or a whole bunch of them. Click on the blue "Download Assemblies" button to see what is available to download.

NOTE: This "blue" button was created as part of the development of a new resource (currently in "beta") which will eventually take over this role as a quick access to download bulk genome-related datasets, including groups of assembles. We'll cover this resource in the "New Resource: NCBI Datasets" section.

Learn about genome annotations in the graphical viewer

Okay, one last time.....Go back to that Genome summary record.....

If you need it, click here to get the link.

1. Click on the RefSeq accession to go to the Nucleotide record for this sequence.
  How can you download this sequence?

1. Click on "Graphics" to see a pictorial representation of the reference sequence.

- - Viral sequences are small, while bacterial genomes are larger and can be visualized in this "Sequence Viewer" graphical display.
  - Fungi often have several chromosomes that make up their genome, and you can only see 1 chromosome at a time - but can be explored in this display or in the newer and more functional Genome Data Viewer (GDV) tool.
Things you can do with this genome browser:
- You can zoom in to a region to see the annotations more clearly or you can search with an annotated accession or name of a feature, such as the name of a gene (such a H in the virus or gyrA in the bacteria).
- Clicking the colorful button next to the "ATG" button will quickly toggle display of the gene product (ex: protein) bars.
- If you move your cursor over the label above the displayed bars, a pop-up a window with annotation and links to more information will appear.
- There’s a whole lot of documentation and several tutorials to help you learn how to use this viewer. To start, click on the “?” icon in the upper right-hand corner of the viewer.

What kind of information can you access about a particular gene or protein?
How can you quickly find more?

Take-away message!

Need a good quality genome sequence or dataset? Search the NCBI Genome database for a RefSeq Reference or Representative Genome record..
To gather all genomic assemblies and annotations or learn detailed information about each, try the NCBI (Genome) Assembly database.
Visualize your choice of pathogen's genome assembly with annotations!
- Viral and bacterial genomes can be visualized in the "Sequence Viewer"
- Eukaryotic chromosomes (such as fungal sequences) can be visualized either in the "Sequence Viewer" or the more fully-featured "Genome Data Viewer".

For more advanced work....

Working with genome browsers

NCBI YouTube Tutorial Playlist for Sequence Viewer
NCBI YouTube Tutorial Playlist for Genome Data Viewer

Annotate & submit to NCBI your own genome data!

Prokaryotic Annotation Pipeline (PGAP) - this is available through a web interface, can be downloaded from GitHub or used in "the cloud"!
Viral Annotation DefineR (VADR) - an application for viral annotations downlodable from GitHub or usable in "the cloud".
NCBI Submission Portal - the best place to go to learn about and begin to submit data to NCBI!

Working with APIs or accessing data via command-line or scripting? Try these!

Entrez Programming Utilities (EUtils) - the NCBI-wide set of APIs for accessing and downloading NCBI data
Entrez Direct (EDirect) - the NCBI-wide command-line tool for accessing and downloading NCBI data
NCBI Datasets - a new, quick dataset resource with it's own command-line tool as well as programming utilities (APIs and Python- & R-related resources)

Last Reviewed: August 5, 2022