Skip Navigation Bar

Unified Medical Language System® (UMLS®)

2015AB UMLS NCBI Taxonomy Source Information

This page lists specific source data elements and provides information on their representation in the UMLS Metathesaurus.


Skip to: Notes, Summary of Changes, Summary of Source-Provided Files Source-Provided File Details

VSAB: NCBI2015_03_23

Notes

Many concepts and terms from the NCBI Taxonomy are excluded during Metathesarus source processing.  The criteria for determining which concepts and terms are excluded or retained are outlined below.  See term type descriptions for additional information

1.  Exclude all names that do not have one of the following name classes:
    scientific name
    synonym
    equivalent name
    common name
    authority

2.  Exclude all concepts below the "species" level in the hierarchy.  Selected concepts with a rank of "no rank" may be retained, depending on their hierarchical level.

3.  Exclude all concepts with a "division id" value of 11 (environmental samples) and their descendents.

4.  Exclude concepts and terms based on certain patterns, e.g. remove concepts with rank = "species" and have the following words in the scientific name "uncultured," "clone," "unidentified," "uncultivated."

5.   Exclude concepts with ugly names (e.g., "xxxx", "4").

6.   Exclude concepts and their children if the information is enclosed in single or double quotes

7.   Exclude concepts starting with "other", "unclassified", "unclassified sequences", "artificial sequences", "insertion sequences", "midivariant sequence", "transposons" and all their children.

8.   Exclude concepts containing "?" and their children

Summary of Changes:

  (return to top)

None

Source-Provided Files: Summary

  (return to top)

The complete NCBI release can be downloaded from the taxonomy ftp site:  ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Documentation and Reference

File Name Description
readme.txt README for file descriptions

Data Files

File Name Description
citations.dmp* Citations file (not processed)
delnodes.dmp* Deleted nodes file (not processed)
division.dmp Divisions file
gc.prt* Genetic code table (not processed)
gencode.dmp* Genetic codes file (not processed)
merged.dmp* Merged nodes file (not processed)
names.dmp Taxonomy names file
nodes.dmp Taxonomy nodes file


Not included: Selected files and fields are not processed.  In addition, certain concepts and terms are not included in the Metathesaurus based on the criteria described in the "Notes" section above.

Source-Provided Files: Details

  (return to top)

Details on format of input files and representation of source data.

file: division.dmp

  return to Data Files

Divisions

# Field Name Description Representation
1 division id taxonomy database division id Used to map the "division id" field of nodes.dmp to the expanded value found in the "division name"
2 division cde GenBank division code (three characters) not processed
3 division name division name MRSAT.ATN = "DIV"
4 comments comments not processed

file: names.dmp

  return to Data Files
# Field Name Description Representation
1 tax_id identifier of node associated with this name MRCONSO.CODE
MRCONSO.SCUI
2 name_txt name itself MRCONSO.STR
3 unique name unique variant of the name if not unique MRCONSO.STR
4 name class type of name

Only the following name class values are included in the Metathesaurus:

scientific name
synonym
equivalent name
common name
authority
Used to assign MRCONSO.TTY

TTY values are assigned as follows:
name class name_txt TTY unique name TTY (if populated)
scientific name SCN USN
synonym SY USY
equivalent name EQ UE
common name CMN UCN
authority AUN UAUN


Atoms with other "name class" values are excluded during UMLS source processing

file: nodes.dmp

  return to Data Files

# Field Name Description Representation
1 tax_id node id in GenBank taxonomy database Used to create the hierarchy.
Also used to identify concepts to be excluded based on "rank":  all concepts below the "species" level are excluded.
2 parent_tax_id parent node id in GenBank taxonomy database Used to create the hierarchy.
Also used to identify concepts to be excluded based on "rank":  all concepts below the "species" level are excluded.
3 rank rank of this node (e.g. superkingdome, kingdom, etc.) MRSAT.ATN = "RANK"
Also used to identify concepts to be excluded based on "rank":  all concepts below the "species" level are excluded.
4 embl code locus-name prefix not processed
5 division id division id (see division.dmp file) MRSAT.ATN = "DIV"
The ATV is the value of the division name for this division id, from division.dmp
6 inherited div flag 1 if node inherits division from parent not processed
7 genetic code id see gencode.dmp fille not processed
8 inherited GC flag 1 if node inherits genetic code from parent not processed
9 mitochondrial genetic code id see gencode.dmp file not processed
10 inherited MGC flag 1 if node inherits mitochondrial gencode from parent not processed
11 GenBank hidden flag 1 if name is suppressed in GenBank entry lineage not processed
12 hidden subtree root flag 1 if this subtree has no sequence data yet not processed
13 comments free text comments and citations not processed