Skip to Content
United States National Library of Medicine National Institutes of Health

2008 MEDLINE®/PubMed® Baseline Distribution

U.S. National Library of Medicine

December 3, 2007 [updated 12/5/07]

This documentation pertains to the 2008 MEDLINE/PubMed baseline database produced after NLM’s end-of-year maintenance in November 2007. Links for this documentation, DTDs, data element descriptions, update file information, and other resources are available from NLM’s Web page for licensees.

1. OVERVIEW

The 2008 baseline database of 16,880,015 records resides in files medline08n0001 through medline08n0563 and completely replaces all 2007 MEDLINE/PubMed data files previously distributed.

The baseline files are grouped by publication years. There are some records in the baseline files with a 2008 PubDate. A list of all baseline files providing file name, years covered, file size, and record count is available. All records in the OLDMEDLINE subset (defined as <CitationSubset> value = OM) reside in the pre-1966 files.

All baseline records are in MEDLINE, OLDMEDLINE, or PubMed-not-MEDLINE MedlineCitation Status and are completed records (quality reviewed and contain the DateCompleted element). The number of records for each status is:

MEDLINE status: 16,209,581
OLDMEDLINE status: 413,020 (see item 5 below)
PubMed-not-MEDLINE status: 257,414

The baseline files do not contain records in the two additional MedlineCitation Statuses distributed to licensees in update files: In-process or In-Data-Review. Records in those statuses begin to be distributed in the first batch of update files referred to as the 'catch-up' files discussed in item 8c below.

Approximately 2% of records in PubMed do not reside in NLM’s Data Creation and Maintenance System (DCMS), thus are not distributed to MEDLINE/PubMed licensees in baseline or subsequent update files. In PubMed those are the records in XML MedlineCitation Status = Publisher and are tagged either [PubMed - as supplied by publisher] or [PubMed - author manuscript in PMC].

A DateRevised date of November 14 or November 15, 2007 was assigned to the over 1.4 million records that were maintained during year-end processing. The 2008 baseline files completely replace all previously distributed MEDLINE/PubMed records; do not attempt to replace records based on the <DateRevised> element.

2. BASELINE DATA VIA FTP:

The 2008 MEDLINE/PubMed baseline database comprised of 563 data files in compressed format is distributed from NLM's ftp server. An md5 check file accompanies each data file. FTP access information is for NLM MEDLINE/PubMed licensees only; do not share directory or file names with others. Be sure to use the IP address you registered with NLM as all other IP addresses are blocked from accessing the files.

3. 2008 DTD:

NLM's MEDLINE DTD dated January 1, 2008 is now in use. This DTD references the MedlineCitation DTD which in turn references the NLMSharedCatCit DTD which in turn references the NLMCommon DTD. The MEDLINE DTD, therefore, is the "parent" DTD and the starting point for licensees. The DTD changes for this year are summarized in the Revision Notes at the top of the DTDs and in the August 23, 2007 announcement.

4. DATA CHANGES FOR 2008

The following highlights or supplements information transmitted in the August 23, 2007 announcement and in articles published in the Nov-Dec 2007 NLM Technical Bulletin.

a. ELocationID

The new ELocationID element with its entity EIdType will house Digital Object Identifiers (DOIs) (EIdType = doi) or Publisher Item Identifiers (PIIs) (EIdType = pii) when provided by publishers for newly published (prospective) articles. The ElocationID element is not present in the baseline distribution files, and is not expected to be added to existing records. It may be present on records in subsequent update files (timing is uncertain; no earlier than January 2008). If an ELocationID is wrong or changed by the publisher, then the publisher must publish an erratum notice in the journal with the incorrect and correct number in order for NLM to edit the ELocationID data in the citation.

b. ISSNLinking

The new ISSNLinking element enables collocation or linking among the different media versions of a continuing resource. This element is not present in the baseline distribution; NLM is uncertain when records will begin to contain this element.

c. InvestigatorList

The existing InvestigatorList elements will begin to be used for MEDLINE/PubMed in the 2008 production year to contain personal names of individuals (e.g., collaborators and investigators) who are not authors of a paper but rather are listed in the paper as members of a collective/corporate group that is an author of the paper.

d. Grant Number Information

The list of Grant Abbreviation and Institute Acronyms used in the XML GrantList elements has been updated. Effective with release of the baseline files, "United States" precedes the names of US granting organizations in the GrantList Ageny element.

5. OLDMEDLINE Subset Status:

Effective with distribution of the 2008 baseline files, all but 413, 020 records of the 1.7 million records in the OLDMEDLINE subset are tagged as MedlineCitation Status = MEDLINE. The criterion for records in the OLDMEDLINE subset being tagged as MEDLINE status is that all of the old MeSH Headings in KeywordList have been mapped to current MeSH vocabulary. The remaining records in the OLDMEDLINE subset are in MedlineCitation Status = OLDMEDLINE.

Records in the OLDMEDLINE subset are derived from the older print indices of the Cumulated Index Medicus (CIM) and the Current List of Medical Literature (CLML). They are identified by the <CitationSubset> value OM. The original subject headings are retained in <KeywordList> and the current MeSH headings reside in <MeshHeadingList>.

6. DOCUMENTATION

This baseline documentation is at http://www.nlm.nih.gov/bsd/licensee/2008_stats/baseline_doc.html. A list of all baseline files providing file name, years covered, file size, and record count is available at http://www.nlm.nih.gov/bsd/licensee/2008_stats/baseline_med_filecount.html and general information about maintenance of MEDLINE/PubMed records is also available at http://www.nlm.nih.gov/bsd/licensee/medline_maintenance.html. The latter page contains guidelines for processing update files which contain new and revised records, and the special file new this year containing PMIDs of records remaining in PubMed and not distributed to licensees (see item 7 below).

Articles in the September-October and November-December 2007 NLM Technical Bulletin (TB) contain details on changes for 2008, including those involving MeSH Vocabulary, and how PubMed or searching PubMed is affected.

Licensees should check http://www.nlm.nih.gov/bsd/licensee/medpmmenu.html which contains links to pertinent documentation, resources, and announcements for licensees. After substantive e-mails have been sent directly to licensees they are posted as announcements. Older announcements are moved to the Archive page at http://www.nlm.nih.gov/bsd/licensee/archive_doc.html. E-mail recipients should forward messages to appropriate staff in your organization.

7. ADDITIONAL RECORDS IN PUBMED AND NOT DISTRIBUTED TO LICENSEES IN UPDATE FILES

This year NLM an additional file is available to licensees. It is a text file of PMIDs of records in MedlineCitation Status = In-Process and MedlineCitation Status = In-Data-Review which have been retained in the 2008 version of PubMed at the time the 2008 baseline files were loaded and which will are not exported to licensees in the first group of update files at this time. This represents a small number of records, but licensees who wish to create a database as close as possible to the record content in PubMed will want to include them. These records should eventually be exported as completed records in MedlineCitation Status = MEDLINE or MedlineCitation Status = PubMed-not-MEDLINE after NLM completes the work on them.

Licensees may use the Entrez Utilities to download the records from PubMed using the list of PMIDs. IMPORTANT: If you elect to add these records to your version of MEDLINE/PubMed, follow the steps provided in the update file access instructions. [edited 12/5/07]

This special file resides in both the .gz and .zip directories at the top of the file name list.

8. UPDATE FILES

a. General

Files containing new and revised records and PMIDs of deleted records update the baseline database and are generally available 5 times per week during the production year. On occasion, an initial or changed <DateRevised> value does not get assigned to revised records. Large numbers of records reflecting various types of maintenance may be distributed in update files throughout the year. See background information about maintenance of MEDLINE/PubMed records at http://www.nlm.nih.gov/bsd/licensee/medline_maintenance.html.

Licensees should use the annual baseline files to replace all records previously distributed and then during the year apply all subsequent update files in ascending file name order to add new records, replace revised records, and delete records from the new baseline database.

Each update file may contain no more than 300,000 records and no more than 10,000 for the first batch, called catch up files (see 8c below). Multiple files will be generated if the total number of new, revised, or deleted records for a given update exceeds the maximum number of records.

b. Order of Processing, Stats Files, Notes Files

To maintain a complete and accurate database, 2008 update files should be applied after the 2008 baseline files are processed (and after records obtained from PubMed using the special text file of PMIDs discussed in item 7 above are added, if you opt to do so). It is critical that update files be processed in ascending numeric order based on filename to ensure that the most current version of each record is retained. Licensees should refer to the _stats.html file that accompanies each data file on the server for a breakdown of record categories, and should also read occasional _notes.txt files that may appear later for additional information about the data distributed in that file, e.g., retracted publications. An md5 check file accompanies each data file. The MEDLINE/PubMed Update Chart for 2008 available from the Licensee Web page summarizes update files.

c. Catch-Up Files

The first group of update files (not including the special text file of PMIDs discussed in item 7 above) is referred to as 'catch-up' files. They include all In-process and In-Data-Review Status records at the time the 2008 baseline files are loaded into PubMed plus all records which have been completed or maintained (but not released) since the final 2007 update file containing new MEDLINE and PubMed-not-MEDLINE status records was available on November 14, 2007. PMIDs for deleted records had also not been exported during this time and are included in the catch-up and subsequent update files.

9. 2008 MeSH VOCABULARY:

The 2008 MeSH Vocabulary may be downloaded. MEDLINE/PubMed licensees may use the MeSH Vocabulary File to fully take advantage of the hierarchical nature of the controlled vocabulary in their implementations of MEDLINE. As the authority file for MEDLINE, the MESH Vocabulary File is used to validate MeSH indexing terms at data entry/input and NLM's PubMed uses translation tables and explodes derived from MeSH in order to enhance search capabilities. These enhanced capabilities are achieved through special programming at NLM which makes possible the following techniques: a) ability to search print see references, non-print see references, and certain data form abbreviations in MEDLINE as though they were the legitimate MeSH Heading; and b) ability to search chemical synonyms in MEDLINE as though they were the legitimate chemical name. The MESH file is also used to generate the MeSH Tree Number. The tree number data are necessary because they are the basis of the capability whereby MeSH terms are arranged hierarchically by subject categories with more specific terms arranged beneath broader terms. NLM uses this feature so that when MeSH terms are searched in PubMed, the program automatically includes the more specific MeSH terms. For example, a search on Hand will also include records retrieved on: Fingers, Thumb, and Wrist because those terms are 'indented' under Hand in the MeSH tree structure.

10. JOURNALS CITED IN MEDLINE/PUBMED

Information about journals cited in MEDLINE/PubMed is found in:

a. The List of Serials Indexed for Online Users available in PDF and XML format and the List of Journals Indexed for MEDLINE available in PDF (these publications do not cover OLDMEDLINE journal titles)
b. LocatorPlus and NLM Catalog, the NLM's online catalog
c. Serfile, another file that may be leased from NLM
d. PubMed journals files (contains limited journal information; updated daily)
e. Entrez Journals database for basic journal information similar to data found in PubMed journals files (available in the Entrez Utilities; updated daily).

Last updated: 05 December 2007
First published: 30 November 2007
Metadata| Permanence level: Permanence Not Guaranteed