NCBI Gene: Finding the Reference Sequences

This tutorial shows how to use the NCBI Gene database to find the reference genomic, transcript and protein sequences for a gene given the gene name or symbol and species.

Click Next for an example scenario.

Example Scenario

One of my collaborators sent me this DNA sequence, which I’ve already blasted. I think they gave me the wrong sequence.

How can I get the right sequence for the MC1R gene?

Step 1: Search for the Gene and Organism

Reference Sequences (or "RefSeqs") are standard sequences, curated by NCBI. When someone asks for the right sequence, they mean a standard or reference sequence.

One way you can find a standard genomic (DNA), transcript (mRNA) or protein sequence is to use the NCBI Gene database.

In the Gene database search box, enter:

human[orgn] AND mc1r[sym]

This will take you directly to the one matching record for this gene.

Step 2: Find the RefSeqs

1 of 6

You are now looking at a Gene record for the "MC1R melanocortin 1 receptor" gene in humans. This record organizes information about the gene in one place.

On the right side of the page is the Table of Contents. Click on the NCBI Reference Sequences (RefSeq) link.

Step 2: Find the RefSeqs

2 of 6

Looking at the section of the Gene record for NCBI Reference Sequences (RefSeqs), note that there are two types of RefSeqs:

  • RefSeqs maintained independently of Annotated Genomes: Here you can view the gene, mRNA or protein sequence as isolated "snippets" of nucleotides or amino acids.
  • RefSeqs of Annotated Genomes: Here you can view the gene sequence as part of an assembly of the entire chromosome.

The genome assembly is the genome sequence produced after chromosomes have been fragmented, those fragments have been sequenced, and the resulting sequences have been put back together.

We don’t often look at records for entire chromosomes. We are generally looking at those smaller records, where we can view the annotations right from our web browser.

Recall our patron's request for "the right sequence for the MC1R gene."

It’s not clear from the person’s question if they want the genomic (DNA), mRNA transcript, or protein sequence.

The DNA sequence is accessible from the GenBank link, which opens the NG_012026.1 record.

The mRNA transcript sequence can be obtained by following the NM_002386.4 link.

The protein sequence can be obtained by clicking on the NP_002377.4 link.

Let's explore the RefSeqGene NG_012026.1 record.

Click the GenBank link to view the reference sequence for NG_012026.1.

Step 2: Find the RefSeqs

3 of 6

Strings of nucleotides don't come numbered in nature, and each genome assembly might number the sequences differently. We therefore need some kind of reference for where a sequence "begins" or "ends."

Regardless of the specific chromosomal location according to the current assembly, the RefSeqGene record is going to start in the same place in the sequence relative to the gene. The RefSeqGene record contains sequences for the annotated portion of the gene as well as 5,000 bases upstream and 2,000 bases downstream.

The RefSeqGene record is always going to start at "1." In our example for MC1R, the gene feature will start at "5984." Note the section on the right of the screen labeled "Change region shown."

This indicates that you are viewing only a portion of the RefSeqGene record.

This extra sequence provides access to potential regulatory regions and allows room for expansion of the gene boundaries.

Sometimes parts of other genes are included in the RefSeqGene record (if they lie within 5,000 bases upstream and 2,000 bases downstream).

Step 2: Find the RefSeqs

4 of 6

Now use the link from More about the MC1R gene in the right column to return to the Gene record.

Step 2: Find the RefSeqs

5 of 6

Return to the Reference Sequence links by using the same link in the Table of Contents, NCBI Reference Sequences (RefSeq).

We've now looked at the RefSeqGene record for the human MC1R gene (NG_012026.1).

Another option is to look at the genomic sequence in the context of the entire assembly.

On the Gene record, you'll find this after the mRNA transcript and protein accession number links. You should see "Reference GRCh38.p14 Primary Assembly."

The Reference Primary Assembly is generated and controlled by the Genome Reference Consortium (GRC).

GRCh38 is the current assembly that is descended from the original, publicly funded human genome project sequence (1990-2005). The "h" in the assembly name stands for human. The number "38" is the build number.

Step 2: Find the RefSeqs

6 of 6

You can check NCBI Datasets Genome page for Homo sapiens to confirm the latest GRCh assembly. Once a new genome assembly is released, it can take some time for NCBI resources to be updated with the new genome assembly. As of January 2025, the newest assembly is GRCh38.p14 (released in February 2022).

Click on GRCh38.14 to view the NCBI data by that assembly. Scroll down to find the Chromosome RefSeq records.

Notice the RefSeq accessions for the genome assembly begin with NC_, which identifies the record as being for a chromosome. The RefSeqGene accession is NG_, which identifies the record as being for a genomic (but not chromosomal) sequence.

Conclusion

Congratulations! You should now be able to find the reference genomic, transcript and protein sequences for a gene, given the gene name or symbol and species.

For additional help, visit NCBI Gene Help Resources.

You can now close the NLM Navigator windows.