BLAST Databases

NCBI has a large number of databases available on the web service. You'll use only a few of them today. You select databases using the pull-down lists and radio buttons in the 'Choose Search Set' section of the BLAST submission form. Both the nucleotide page and the protein page have standard and experimental databases.

The nucleotide and protein pages have default databases called nr/nt and nr respectively. Other databases include useful subsets of these as well as separate databases with different content. Today you'll search the default databases as well as the NCBI RefSeq subsets. You'll also search a whale genome assembly that's available through a separate Genome BLAST page

Note: The sequence entries in the BLAST databases are non-redundant. (This is what the database name nr stands for.) Non-redundant means that identical sequences are represented by a single entry in the database. In the case of protein sequences, sometimes hundreds of sequences may be collapsed into a single entry. BLAST provides a way to access all of the identical proteins for a non-redundant sequence in the BLAST database.

Nucleotide

Standard Nucleotide Databases

The default database (nr/nt) contains traditional GenBank and RefSeq RNA sequences.
- less comprehensive than protein nr
- no RefSeq genome sequences and no eukaryotic genome assemblies
- no high throughput sequences (whole genome shotgun, next-gen reads from the Sequence Read Archive.
Subsets of nr/nt
- RefSeq RNA
  - predicted transcripts from the eukaryotic genome annotation pipeline
  - high quality NCBI-curated transcript sequences
- RefSeq Select
  - selected curated transcripts for mouse and human genes
  - one transcript per gene.
- RefSeq Representative Genomes
  - Highest quality NCBI curated RefSeq genome assemblies for a taxon.
  - Example: the human GRC38 reference genome assembly.

Experimental Nucleotide Databases: Organism-based nt databases

The very large — 96 million sequences and 1.3 trillion total bases, and rapidly increasing size of the nr/nt database makes it difficult to manage and maintain for both NCBI and outside people who use BLAST locally.

We're experimenting with ways to divide nt/nt into more manageable subsets. One strategy is to divide it by organism (taxonomy) into four databases.

Eukaryota nt (77 million sequences and 836 billion bases)
Prokaryota nt (8.4 million sequences and 207 billion bases)
Viruses nt (10 million sequences and 217 billion bases)
Others nt (919 thousand sequences and 1.9 billion bases)

Others nt includes metagenomic and artificial sequences.

These databases are on the web and available to download. We are interested in feedback on how useful they are.

Protein

Standard Protein Databases

The default protein database nr contains nearly all protein sequences available at NCBI.
- no patent sequences
- no wgs metagenomes and no transcriptome shotgun assembly proteins
- includes protiens from outside protein-only sources that are alsoavailable as separate databases.
Subsets of nr
- RefSeq Protein
  - coding region translations from RefSeq transcripts and RefSeq genome annotations
- RefSeq Select Protein
  - coding region translation from mouse and human RefSeq Select transcripts plus coding regions from RefSeq reference and representative prokaryotic genome

Experimental Protein Databases: ClusteredNR

The protein nr database is also large — 595 million sequences and 234 billion residues, and growing rapidly. In addition, certain kinds of proteins and groups of organisms such as mammals and their associated bacteria are over represented and will dominate protein searches. This makes it difficult to identify more distant homologs in other species.

One strategy to collapse some of the over-representation is to form clusters at less than 100% identity.

The experimental ClusteredNR has clusters where the members are 90% identical and within 90% sequence length.This provides access to more distant matches and reduces the computational burden of searching entire nr since BLAST only searches the representative sequence for each cluster. ClusteredNR currently has 286 million sequences and 94 bilion total residues — about 40% the size of the full nr. You'll use this database to expand the results for a blastx search and as a way of incorporating more distant matches in a PSI-BLAST search.

Last Reviewed: July 10, 2023