Skip Navigation Bar

Citation Maintenance tasks in XML format - 2009

 

1. The need for citation maintenance - MeSH changes

The Global Citation Maintenance (GCM) data in XML format makes available the annual changes which are made by NLM in the MeSH indexing of citations in PubMed and distributed MEDLINE. Users of other systems that use MeSH for subject indexing may also find the GCM data helpful for their indexed documents, but they must be aware of relevant differences from the NLM database. For example, the searches required by manual tasks are specific to PubMed syntax.

The MeSH vocabulary is updated annually. The primary goal of citation maintenance is ensure that the existing indexing with MeSH of the citations is consistent with the current version of the MeSH vocabulary while retaining the intent of the existing indexing. Changes in MeSH which may impact citations are: (a) deletions of MeSH headings, and (b) changes in the preferred term of a MeSH heading. Indexing terms which have been deleted or replaced in the MeSH vocabulary must themselves be removed or replaced in the citation in order to remain consistent with MeSH. Citation maintenance is concerned with how to appropriately replace the old reference.

Citations or other documents indexed with MeSH terms are usually indexed by the MeSH term or the MeSH Unique Identifiers (UIs) which refer to a MeSH vocabulary record. The GCM data are intended to provide sufficient information to allow systems using either terms or UIs to be updated correctly.

In the past, the MeSH Section has made available lists of Deleted Headings (deleted Descriptor records) and Replaced Headings (changes in Descriptor preferred terms). However, no information has been available for changes in Supplementary Concept Records (SCRs), nor has more detailed record information, such as the unique identifier, been included.

Citation maintenance is accomplished by "tasks" - database transactions which make a specific change in the indexing of a set of citations. See section 3 for types of tasks, section 6 for a detailed description of task elements. One of the essential features of executing these tasks is the relative order among tasks of different types as well as the order of required citation queries. See section 4 for a fuller account of the sequence of tasks and queries. A chart is also available which represents the maintenance procedure graphically.

2. Availability

GCM data represent annual changes in the MeSH vocabulary which are available in MEDLINE by January of each year. Annual changes in Supplementary Concept Records (SCRs), especially changes affecting Descriptors, are included in the data, though SCR changes made regularly throughout the year are not currently included.

3. Types of maintenance tasks

For a detailed explanation of the GCM files format, see sections 4 and 5, below. The format of the task data and how they are to be used, depends on the type of task, which is explained in the following.

3.1 Updating the indexing - the MeSH preferred term and UI

Indexing with MeSH headings consists in the assignment to a citation of a reference to a MeSH Descriptor, Qualifier, or Supplementary Concept Record (SCR). The reference may be either: (a) the preferred term in the record, for example, 'Heart Arrest', or (b) an alpha-numeric unique identifier (UI) for the MeSH record, for example, 'D006323'. Citations in NLM's Medline XML, for example, use the preferred term in the <MeshHeading>, <NameOfSubstance>, and <QualifierName>. Other systems may index with only the MeSH UI and not the MeSH term. To accomodate both types of indexing, the GCM data include both a MeSH UI and the corresponding preferred term for every update action.

Specific "tasks" or transactions are created to change the MeSH indexing in a citation. A task either: (a) replaces an existing MeSH reference with another, (b) adds a reference, or (c) deletes a reference.

3.2 Main types of tasks

Maintenance tasks are divided into three categories that reflect the source of the task. This affects the order in which the task is executed and its scope.

  • Preferred Term changes

    When the preferred term in a MeSH record has changed, indexing by MeSH term must be replaced by the new preferred term. For example, in 2009 MeSH the preferred term for the heading Interferon Type II was changed to Interferon-gamma. This is essentially a name change and is usually the most transparent of indexing changes.

    Preferred term tasks are applied to every citation in the database and always replace an existing preferred term with a different preferred term.

  • "Automatic" tasks - algorithmic replacements

    When a MeSH record is deleted, references to the record are usually replaced with references to a different MeSH record. For example, in 2009 MeSH the Descriptor record for Electrostatics (UI = D019312) was deleted. Existing citation references were replaced with references to another record Static Electricity (UI = D055672). These tasks are called automatic because the replacement is determined by algorithm, though the replacement is originally specified by the MeSH subject specialist when the MeSH record is deleted.

    Automatic tasks are applied to every citation in the database and either replace an existing value with a new value, or delete the old value altogether.

    Note that the result of applying Automatic tasks is that every MeSH record referenced in the citations is valid in the New MeSH year. Combined with the application of Preferred Term changes, the result is that all citation references to MeSH records are valid MeSH terms or UIs for the New MeSH year. (Assuming that citation references prior to maintenance were valid for the previous MeSH year.)

  • "Manual" tasks - case by case changes, requiring a search

    This type of task is called "Manual" because a MeSH specialist determines the proper maintenance on a case-by-case basis. Manual tasks are often used to refine the results of a previously-run Automatic task. For this reason, Manual tasks must be run after Automatic tasks. (Thus a Manual task may apply to data introduced by a previously run Automatic task.)

    While Automatic and Preferred Term tasks are applied to every citation in the database, Manual tasks apply only to citations identified by searches in GCM_SEARCH.XML. A Manual task may replace an existing value with a new value, but may also just add a value or just delete a value. Manual tasks are not essential for preserving valid MeSH references, but they are necessary for preserving the intent of the existing indexing.

4. Order of tasks and queries

The order in which the tasks and queries must be performed can be critical because a task or query may be affected by a previous task. This is especially true when the indexing is done with MeSH terms rather than by MeSH Unique Identifiers (UIs), since terms may be changed without a change in UI.

4.1 Queries for Manual tasks are run before maintenance.

Whether indexing by MeSH term or UI, if Manual tasks are to be used, the queries for the Manual tasks must be independent of later maintenance tasks. This is because the queries used to restrict the application of Manual tasks refer to MeSH terms in the previous year's MeSH and so could be affected by either the Automatic tasks or Preferred Term changes implemented after the queries are formulated. So the queries must be independent of these changes. There are at least two ways to do this. NLM uses the first method.

  1. Save citation identifiers for later Manual tasks.

    One way to implement this is to save the citation identifiers which are retrieved by the search, mapped to a given <MTaskID>. These may range in number from a handful to hundreds. Then when Manual tasks are run, they apply to the citation UIs associated with that <MTaskID>. NLM uses this method.
  2. Preserve parallel unmaintained citations for Manual tasks.

    An alternative is to create two copies of the citation database - the first of which is not maintained, and the second of which is maintained. Then run the search statements for Manual tasks against the first, non-maintained, database, but apply the maintenance to the second, maintained database. This obviates the need to create special storage for citation references, but requires a duplicate database.

4.2 Automatic tasks

Automatic tasks are the principal maintenance tasks and the first tasks to be done. Manual tasks are run after the Automatic and Preferred Term tasks because the manual tasks are written to supplement or adjust those results. The order among Automatic tasks does not matter since one Automatic task cannot impact another Automatic task - the maintained-to Descriptor cannot be a deleted record.

4.3 Preferred Term tasks - run after Automatic tasks but before Manual tasks

When updating indexing by term, rather than indexing by UI, it is possible for a Preferred Term task to impact an Automatic task. Therefore, Preferred Term tasks must be run after Automatic tasks.

However, Manual tasks are written with the expectation that Automatic and Preferred Term tasks have already been run. Therefore, Preferred Term tasks must be completed before Manual tasks.

As noted earlier, changes in the MeSH preferred term are implemented only for systems that index by MeSH term rather than MeSH Unique Identifier (UI). However, systems that index with MeSH UI must have available a database of MeSH terms for the new MeSH year in order to display or otherwise produce the appropriate preferred term.

4.4 Manual tasks - run after Preferred Term tasks

Manual tasks are usually created to supplement Automatic tasks. They are therefore written with the assumption that the Automatic tasks have already run, and are therefore always run against the citation database after the automatic tasks. For similar reasons Manual tasks are run after Preferred Term tasks.

4.5 Summing up the order of processing

The following table summarizes the steps required for updating a term-indexed database . The processing will be the same for UI-indexed databases except that step (3) - PrefTerm tasks - will not be applicable. A chart is also available which represents the maintenance procedure graphically.

ProcessDescriptionSequence
1. Queries for Manual tasksRetrieve sets of citations to be used to specify the range of Manual tasks to be run later.Query results must be obtained first since later maintenance could impact the queries, written for the previous year's MeSH.
2. Automatic tasksReplace all references to deleted MeSH records with references to other MeSH records.Must be run before Manual tasks since Manual tasks are written to supplement Automatic tasks.
3. PrefTerm tasksReplace MeSH preferred term with a different preferred term.Must be run after Automatic tasks to avoid impacting these tasks.
4. Manual tasksSupplement Automatic tasks, usually by adding additional references. Applied to citations previously obtained by query.Must be run after Automatic tasks, applied to citations identified earlier by queries for each Manual task.

The <Sequence> element in the GCM XML is designed to ensure this order, as well as the order among Manual tasks.

5. Files

GCM data are distributed in two files.

  • GCM.XML. The main file includes a list of every maintenance task, with the old and new values, MeSH UI, etc. See below for a more detailed description of the elements.

  • GCM_SEARCH.XML. Some maintenance tasks apply only to a specified subset of the database and so they require a search description that narrows the scope of the task. This file is a list of the searches (in PubMed format) for each of the Manual tasks.

In practice the file names will reflect the MeSH year of annual changes. So, for example, for 2009 MeSH, the files will be GCM2009.XML and GCM_SEARCH2009.XML.

The XML structure for GCM.XML is relatively simple, with only two element levels and two attributes. See the GCM2009.DTD and sample GCM2009.XML file. The GCM_SEARCH.XML file is even simpler, with a task ID mapping the search to the corresponding task in the GCM.XML file. See GCM_SEARCH2009.DTD and sample GCM_SEARCH2009.XML file. See also the more detailed data element descriptions for both sets of files, below.

Data are encoded in UTF-8 format. Currently the data are also compatible with 7-bit ASCII encoding.

Files are also available for all MeSH records in XML format. Medline and other NLM data in XML format are also available.

6. XML elements

The following two tables list each XML element and attribute for the two files, with a brief description. Following the tables, there is a more discursive description of the elements, including examples in XML format.

6.1 Synopsis of XML elements

The following is a list of GCM elements in tabular format, with a brief description of each.

GCM.XML

Element/attributeValue RangeDescription
CitMaintTaskSet Set of all tasks. Root element.
CitMaintTask Specific task to replace, add, or delete indexing data.
/ActionReplace, Add, DeleteNature of the change to the citation.
/TaskSourceTypeManual, Automatic, PrefTermProcess by which task was created.
MTaskIDM..., A...., P....Unique identifier for the task. Leading alphabetic, remainder numeric.
MeSHYear(YYYY)Year when annual MeSH changes first appear in January.
ExistingMeSHUID......, C......, Q......UI of the MeSH record reference being replaced or deleted. Null when Action is Add. Same value as NewMeSHUI for PrefTterm change.
NewMeSHUID......, C......, Q.....UI of the MeSH record reference replacing the old value, or being added. Null when Action is Delete. Same value as ExistingMeSHUI when only preferred term being changed. May include attached Qualifier UI.
ExistingMeSHPrefTerm(string)Preferred term for ExistingMeSHUI.
NewMeSHPrefTerm(string)Preferred term for NewMeSHUI.
ExistingMeSHRecTypeDESCRIPTOR, SCR, QUALIFIER 
NewMeSHRecTypeDESCRIPTOR, SCR, QUALIFIER 
MajorTopicYNY, NNew value may be marked as the major topic of the citation.
Sequence(positive integer)Order in which tasks must be run.

GCM_SEARCH.XML

Element/attributeValue RangeDescription
CitMaintSearchSet Set of all searches for Manual tasks. Root element.
CitMaintSearch Information needed to identify search which is needed to apply a Manual task in GCM.XML.
MTaskIDM..., A...., P....Maps search to Manual task in SEARCH.XML having the same <MTaskID>.
MeSHYear(YYYY)Year when annual MeSH changes first appear in January. Not the MeSH year of the MeSH terms in the search, which is one year previous to <MeSHYear>.
SearchPubMed(free text)Search limiting application of a Manual task. Must be run prior to any maintenance. Plus signs in the qualified portion of the search are used during automated searching of PubMed. To search in PubMed manually, the plus signs must be replaced by spaces.

6.2 Alphabetic List of XML elements

The following are the elements in the two XML files.

Action
Description: Nature of the change to the citation. One of the following: Replace, Add, Delete.
Example:

  <CitMaintTask Action="Replace" TaskSourceType="Automatic">

Subelement of: n/a; attribute of <CitMaintTask>
In file: GCM.XML.
Required element: yes

<CitMaintSearch>
Description: Information needed to apply a citation search to a given Manual task. Used to restrict the application of a Manual tasks to a given set of citations. The search applies to the Manual task in the GCM.XML which has the same <MTaskID>.
Subelement of: <CitMaintSearchSet>.
In file: GCM_SEARCH.XML.
Required element: yes

<CitMaintSearchSet>
Description: Set of all <CitMaintSearch> elements in GCM_SEARCH.XML. Root element.
Subelement of: none; this is the root element of the GCM_SEARCH.XML.
In file: GCM_SEARCH.XML.
Required element: yes

<CitMaintTask>
Description: Transaction consisting of all the information needed to change an instance of MeSH-indexing in a citation record.
Subelement of: <CitMaintTaskSet>
In file: GCM.XML.
Required element: yes

<CitMaintTaskSet>
Description: The set of all <CitMaintTask> elements in the GCM.XML file
Subelement of: none; this is the root element of the GCM.XML.
In file: GCM.XML.
Required element: yes

<ExistingMeSHPrefTerm>
Description: Preferred term in MeSH for <ExistingMeSHUI>. Null when Action is Add. Critical for PrefTerm changes, may be redundant for Automatic and Manual changes. May be the same as <NewMeSHPrefTerm> in the same task when TaskSourceType is Manual or Automatic. Example:

  <ExistingMeSHPrefTerm>Aborigines</ExistingMeSHPrefTerm>

Subelement of: <CitMaintTask>
In file: GCM.XML.
Required element: no

<ExistingMeSHRecType>
Description: The MeSH record type of the <ExistingMeSHUI>. One of the following DESCRIPTOR, QUALIFIER, SCR. Null when Action is Add. Redundant in that the record type may be inferred from the initial character of <ExistingMeSHUI> (D, Q, C). Designed to make it easier for users of XML to extract actions pertaining to only one record type. May be different from <NewMeSHRecType> in the same task.
Example:

<ExistingMeSHRecType>SCR</ExistingMeSHRecType>

Subelement of: <CitMaintTask>
In file: GCM.XML.
Required element: no

<ExistingMeSHUI>
Description: UI of the MeSH record reference being replaced or deleted. Matches the seven-character string in a <DescriptorUI>, <SupplementalRecordUI>, or <QualifierUI>. Null when Action is Add. Same value as <NewMeSHUI> in the same task for PrefTterm change. Not necessarily in the previous year of MeSH but could be an intermediate value in the maintenance process.
Example:

<ExistingMeSHUI>C039562</ExistingMeSHUI>

Subelement of: <CitMaintTask>
In file: GCM.XML.
Required element: no

<MajorTopicYN>
Description: Medline indexing includes an optional indicator for Descriptors representing a main point of a citation. So in a maintenance task which adds a reference to a citation (Add or Replace), major topic of the citations may be indicated by a "Y" value. (Cf. Medline MajorTopicYN, which is an attribute of the <DescriptorName>, rather than a separate element. The GCM.XML uses a separate element for the MajorTopicYN rather than make it an attribute of two elements - the <NewMeSHPrefTerm> and the <NewMeSHUI>.)
Example:

<MajorTopicYN>Y</MajorTopicYN>

Subelement of: <CitMaintTask>
In file: GCM.XML.
Required element: no

<MeSHYear>
Description: Year when annual MeSH changes first appear in January. All "new" data in the XML will be consistent with MeSH data in that <MeSHYear>. In the GCM_SEARCH.XML it has this meaning as well and does not mean the MeSH year of the MeSH terms in the <Search> element, which will be the year prior to the <MeSHYear>
Example:

  <MeSHYear>2005</MeSHYear>

Subelement of: <CitMaintTask>
In file: GCM.XML, GCM_SEARCH.XML.
Required element: yes

<MTaskID>
Description: Unique identifier for each <CitMaintTask>. For PrefTerm tasks the value begin with 'P', for Automatic tasks 'A', and for Manual tasks 'M'. Will be unique across years. The numeric portion has no inherent significance.
Examples:

  <MTaskID>A2</MTaskID>  <MTaskID>M1107</MTaskID>

Subelement of: <CitMaintTask>
In file: GCM.XML; GCM_SEARCH.XML.
Required element: yes

<NewMeSHPrefTerm>
Description: Preferred term in MeSH for <ExistingMeSHUI>. Null when Action is Delete. Critical for PrefTerm changes, may be redundant for Automatic and Manual changes. May be the same as <ExistingMeSHPrefTerm> in the same task when TaskSourceType is Manual or Automatic. Example:

  <NewMeSHPrefTerm>Oceanic Ancestry Group</NewMeSHPrefTerm>

Subelement of: <CitMaintTask>
In file: GCM.XML.
Required element: no

<NewMeSHRecType>
Description: The MeSH record type of the <NewMeSHUI>. One of DESCRIPTOR, QUALIFIER, SCR. Null when Action is Delete. Redundant in that the record type may be inferred from the initial character of <NewMeSHUI> (D, Q, C). Designed to make it easier for users of XML to extract actions pertaining to only one record type. May be different from <ExistingMeSHRecType> in the same task.
Example:

  <ExistingMeSHRecType>DESCRIPTOR</ExistingMeSHRecType>

Subelement of: <CitMaintTask>
In file: GCM.XML.
Required element: no

<NewMeSHUI>
Description: UI of the MeSH record reference replacing the existing value, or being added. Matches the seven-character string in a <DescriptorUI>, <SupplementalRecordUI>, or <QualifierUI>. Null when Action is Delete. Same value as <ExistingMeSHUI> in the same task for PrefTterm change. When a <DescriptorUI>, the value may include an attached<QualifierUI>. (See example.)
Examples:

  <NewMeSHUI>D043203</NewMeSHUI>  <NewMeSHUI>D008628/Q000627</NewMeSHUI>

Subelement of: <CitMaintTask>
In file: GCM.XML.
Required element: no

<Sequence>
Description: Number indicating order in which tasks for a given year are executed. The order in which the tasks must be performed is: (a) Automatic, (b) Preferred Term, and (c) Manual. In addition, a specific order may be required within the Manual tasks. To guarantee this order, the <Sequence> values are assigned in the follow way:

All Automatic tasks have a value of 1.
All PrefTerm tasks have a value of 2.
All Manual tasks have a value of 3 or greater, depending on the order specified by the analyst creating the Manual task.

Example:

  <Sequence>1</Sequence>

Subelement of: <CitMaintTask>
In file: GCM.XML.
Required element: yes

<SearchPubMed>
Description: A citation search used to restrict the application of a Manual task specified in GCM.XML. PubMed format - see http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html.

Example:

  <SearchPubMed>biota [nm] AND+MEDLINE+[sb]</SearchPubMed>

Subelement of: <CitMaintSearch>
In file: GCM_SEARCH.XML.
Required element: yes

TaskSourceType
Description: Process by which task was created. One of the following: PrefTerm, Automatic, Manual.
Example:

  <CitMaintTask Action="Replace" TaskSourceType="Automatic">

Subelement of: n/a; attribute of <CitMaintTask>
In file: GCM.XML.
Required element: yes