Skip Navigation Bar

Unified Medical Language System® (UMLS®)

2013AB UMLS NCBI Taxonomy Source Information

This page lists specific source data elements and provides information on their representation in the UMLS Metathesaurus.




Skip to: Notes, Summary of Changes, Summary of Source-Provided Files Source-Provided File Details

Notes

Many concepts and terms from the NCBI Taxonomy are excluded during Metathesarus source processing.  The criteria for determining which concepts and terms are excluded or retained are outlined below.  See term type descriptions for additional information

1.  Exclude all names that do not have one of the following name classes:
    scientific name
    synonym
    equivalent name
    common name
    authority

2.  Exclude all concepts below the "species" level in the hierarchy.  Selected concepts with a rank of "no rank" may be retained, depending on their hierarchical level.

3.  Exclude all concepts with a "division id" value of 11 (environmental samples) and their descendents.

4.  Exclude concepts and terms based on certain patterns, e.g. remove concepts with rank = "species" and have the following words in the scientific name "uncultured," "clone," "unidentified," "uncultivated."

5.   Exclude concepts with ugly names (e.g., "xxxx", "4").

6.   Exclude concepts and their children if the information is enclosed in single or double quotes

7.   Exclude concepts starting with "other", "unclassified", "unclassified sequences", "artificial sequences", "insertion sequences", "midivariant sequence", "transposons" and all their children.

8.   Exclude concepts containing "?" and their children

Summary of Changes:

  (return to top)

1) Strings containing 'environmental sample' with a division_id = 11 (environmental sample) are not being processed.

2) New name class = 'type material' is not being processed.

Source-Provided Files: Summary

  (return to top)

The complete NCBI release can be downloaded from the taxonomy ftp site:  ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Documentation and Reference

File Name Description
readme.txt
README for file descriptions

Data Files

File Name Description
citations.dmp* Citations file (not processed)
delnodes.dmp* Deleted nodes file (not processed)
division.dmp
Divisions file
gc.prt* Genetic code table (not processed)
gencode.dmp*
Genetic codes file (not processed)
merged.dmp* Merged nodes file (not processed)
names.dmp Taxonomy names file
nodes.dmp Taxonomy nodes file


Not included:

Selected files and fields are not processed.  In addition, certain concepts and terms are not included in the Metathesaurus based on the criteria described in the "Notes" section above.

Source-Provided Files: Details

  (return to top)

Details on format of input files and representation of source data.

file: division.dmp

  return to Data Files

Divisions

# Field Name Description Representation
1
division id
taxonomy database division id
Used to map the "division id" field of nodes.dmp to the expanded value found in the "division name"
2
division cde
GenBank division code (three characters)
not processed
3
division name
division name
MRSAT.ATN = "DIV"
4
comments
comments
not processed

file: names.dmp

  return to Data Files

# Field Name Description Representation
1
tax_id
identifier of node associated with this name
MRCONSO.CODE
MRCONSO.SCUI
2
name_txt
name itself
MRCONSO.STR
3
unique name
unique variant of the name if not unique
MRCONSO.STR
4
name class
type of name

Only the following name class values are included in the Metathesaurus:

scientific name
synonym
equivalent name
common name
authority
Used to assign MRCONSO.TTY

TTY values are assigned as follows:
name class
name_txt TTY
unique name TTY (if populated)
scientific name
SCN
USN
synonym
SY
USY
equivalent name
EQ
UE
common name
CMN
UCN
authority
AUN
UAUN


Atoms with other "name class" values are excluded during UMLS source processing





file: nodes.dmp

  return to Data Files

# Field Name Description Representation
1
tax_id
node id in GenBank taxonomy database
Used to create the hierarchy.
Also used to identify concepts to be excluded based on "rank":  all concepts below the "species" level are excluded.
2
parent_tax_id
parent node id in GenBank taxonomy database
Used to create the hierarchy.
Also used to identify concepts to be excluded based on "rank":  all concepts below the "species" level are excluded.
3
rank
rank of this node (e.g. superkingdome, kingdom, etc.)
MRSAT.ATN = "RANK"
Also used to identify concepts to be excluded based on "rank":  all concepts below the "species" level are excluded.
4
embl code
locus-name prefix
not processed
5
division id
division id (see division.dmp ifle)
MRSAT.ATN = "DIV"
The ATV is the value of the division name for this division id, from division.dmp
6
inherited div flag
1 if node inherits division from parent
not processed
7
genetic code id
see gencode.dmp fille
not processed
8
inherited GC flag
1 if node inherits genetic code from parent
not processed
9
mitochondrial genetic code id
see gencode.dmp file
not processed
10
inherited MGC flag
1 if node inherits mitochondrial gencode from parent
not processed
11
GenBank hidden flag
1 if name is suppressed in GenBank entry lineage
not processed
12
hidden subtree root flag
1 if this subtree has no sequence data yet
not processed
13
comments
free text comments and citations
not processed