|April 23, 2003 [posted]
|Implementation of New Guidelines for the
Structure and Nomenclature of Protein Concepts in MeSH
ver the past 23 years individual proteins appearing in the literature were indexed with use of supplemental concept records (SCRs). During this period the process of new protein SCR creation was linked to the first appearance of protein sequence data in an article cited in MEDLINE. The recent increase in published sequence data and the concurrent use of short acronym names for proteins has resulted in the need to revise the MeSH protein thesaurus and develop a new system to accurately index and retrieve protein-related information.
Under the new guidelines individual proteins are represented by SCRs, while descriptor records represent protein classes. Individual proteins are defined in MeSH as a unique protein from a single species. Protein subunits, alternative mRNA splice variants and polymorphic variants of the same protein may be included as subordinate concepts within the same record. To avoid confusing proteins with similar or even identical names, the name of the protein is followed by the organism name from which it is derived. The preferred name of each protein is the approved name found in curated genome databases followed by the organism name. The curated genome databases have rooted-out duplicated and obsolete protein names. The use of a curated protein name followed by the organism name for each protein results in the creation of highly specific and non-duplicated protein terminology. Specific examples of re-formatted protein names are shown below.
Examples of Organism-Specific Proteins
To effectively handle the vast numbers of new, organism-specific, proteins discussed in the literature, the creation of new SCR protein records is limited to proteins from specific model organisms and to proteins of special biomedical importance such as proteins directly involved in pathogenesis and those used as therapeutic agents or as diagnostic reagents. All other proteins will be represented by coordination of a MeSH protein class descriptor and a MeSH organism descriptor. The list of MeSH model organisms includes: human, mouse, rat, Drosophila, Xenopus, S cerevisiae, S pombe, E coli, Zebrafish, C elegans and Arabidopsis. In the future additional model organism categories may be added to the protein SCRs. These categories will be initially represented by the existing organism-specific SCRs that were previously considered biomedically important and could be supplemented by the creation of new SCRs from protein information derived from authoritative sources.
The existing supplemental concept records for proteins are being revised to conform to these new guidelines. Current SCRs that represent individual proteins are being reformatted to include organism-specific protein terms and official preferred terms from curated databases. Non-specific terms are being removed. SCRs that represent a class of proteins are being promoted to MeSH descriptors, while specific proteins found in the record are being demoted to organism-specific SCRs. In addition current SCRs that represent multiple proteins or the same protein from multiple organisms will be broken into individual protein SCRs.
In each case, appropriate maintenance will be performed on MEDLINE citations. Some of this maintenance was done for the 2003 system as part of year-end processing, some may occur throughout calendar year 2003, and the majority should be accomplished with the successful completion of year-end processing for the 2004 system. Some situations are straightforward and the old SCR in a MEDLINE Name of Substance element is simply replaced by the new form of the name. Other situations are complex and involve breaking up a single SCR that previously referred to multiple proteins into individual, organism-specific SCRs. These require search strategies against PubMed to isolate the citations that need maintenance to one or more of the new SCRs. For example, last year the SCR of transcription factor TFIIA was promoted to a MeSH Heading and two new SCRs were created: TOA1 protein, S cerevisiae and TOA2 protein, S cerevisiae. During year-end processing, these two searches were run against PubMed:
The new SCR of TOA1 protein, S cerevisiae was added as a new name of substance to the citations that were found by Search 1 while TOA2 protein, S cerevisiae was added as a new name of substance to the citations that were found by Search 2.
Thus far we have identified and revised over 10,000 organism-specific protein SCRs, of which 5,800 are from model organisms. Based upon the current rate of editing 70% of existing protein SCRs will be completed by the release of 2004 MeSH. Upon completion of this project we anticipate having approximately 30,000 organism-specific proteins represented as individual protein SCRs in MeSH.
By James M. Pash, Ph.D.
Pash JM. Implementation of New Guidelines for the Structure and Nomenclature of Protein Concepts in MeSH. NLM Tech Bull. 2003 Mar-Apr;(331):e10.