Skip to Content
United States National Library of Medicine National Institutes of Health

2007 MEDLINE®/PubMed® Baseline Distribution

U.S. National Library of Medicine
December 11, 2006

This documentation pertains to the 2007 MEDLINE/PubMed baseline database produced after NLM’s end-of-year maintenance in November 2006.  Links for this documentation, DTDs, data element descriptions, update file information, and other resources are available from NLM’s Web page for licensees.

1. OVERVIEW

The 2007 baseline database of 16,120,074 records resides in files medline07n0001 through medline07n0538 and completely replaces all 2006 MEDLINE/PubMed data files previously distributed.

The baseline files are grouped by publication years. There are some records in the baseline files with a 2007 PubDate. A list of all baseline files providing file name, years covered, file size, and record count is available. All records in the OLDMEDLINE subset (now defined as <CitationSubset> value = OM) reside in the pre-1966 files.

All baseline records are in MEDLINE, OLDMEDLINE, or PubMed-not-MEDLINE MedlineCitation Status and are completed records (quality reviewed and contain the DateCompleted element). The number of records for each status is:

MEDLINE status:   15,435,706
OLDMEDLINE status:   509,161 (see item 4b below)
PubMed-not-MEDLINE status:   175,207

The baseline files do not contain records in the two additional MedlineCitation Statuses distributed to licensees in update files: In-process or In-Data-Review. Records in those statuses begin to be distributed in the first batch of update files referred to as the 'catch-up' files discussed in item 6c below.

Approximately 2% of records in PubMed do not reside in NLM’s Data Creation and Maintenance System (DCMS), thus are not distributed to MEDLINE/PubMed licensees in baseline or subsequent update files. In PubMed those are the records in XML MedlineCitation Status = Publisher and are tagged either [PubMed - as supplied by publisher] or [PubMed - author manuscript in PMC].

A DateRevised date of November 15, 2006 was assigned to 7,852,603 of the records that were maintained during year-end processing and a DateRevised date of November 17, 2006 was assigned to 40 others. Some maintenance actions during year-end processing did not generate any DateRevised value. The 2007 baseline files completely replace all previously distributed MEDLINE/PubMed records; do not attempt to replace records based on the <DateRevised> element.

2. BASELINE DATA VIA FTP:

The 2007 MEDLINE/PubMed baseline database comprised of 538 data files in compressed format is distributed from NLM's ftp server. An md5 check file accompanies each data file. FTP access information is for NLM MEDLINE/PubMed licensees only; do not share directory or file names with others. Be sure to use the IP address you registered with NLM as all other IP addresses are blocked from accessing the files.

3. DTD:

NLM's MEDLINE DTD dated January 1, 2007 is now in use. This DTD references the MedlineCitation DTD which in turn references the NLMSharedCatCit DTD which in turn references the NLMCommon DTD. The MEDLINE DTD, therefore, is the "parent" DTD and the starting point for licensees. The DTD changes for this year are summarized in the Revision Notes at the top of the DTDs and in several announcements.

4. DATA CHANGES IN BASELINE FILES:

a. <Title> Element:

1). The descriptive bracketed phrase designating journal media that had appeared at the end of some complete journal names and the ending periods have been removed. For example:

Old format:       <Title> BMC bioinformatics [electronic resource].</Title>
New format:     <Title> BMC bioinformatics</Title>

Old format:       <Title> Molecular microbiology.</Title>
New format:     <Title> Molecular microbiology</Title>

2). The extra space that had appeared after the colon in titles that have subtitles has been removed.   In the following example, only one space now appears between the colon and the subtitle portion of the complete journal title:

<Title>Therapeutic apheresis and dialysis : official peer-reviewed journal of the International Society for Apheresis, the Japanese Society for Apheresis, the Japanese Society for Dialysis Therapy</Title>

3). Records were revised to remove an upside-down question mark that had appeared in some complete journal titles.

b. OLDMEDLINE Subset Status Change:

Effective with distribution of the 2007 baseline files, all but 509,161 records of the 1.7 million records in the OLDMEDLINE subset are tagged as MedlineCitation Status = MEDLINE. The criterion for records in the OLDMEDLINE subset being tagged as MEDLINE status is that all of the old MeSH Headings in KeywordList have been mapped to current MeSH vocabulary. The remaining records in the OLDMEDLINE subset are in MedlineCitation Status = OLDMEDLINE.

Records in the OLDMEDLINE subset are derived from the older print indices of the Cumulated Index Medicus (CIM) and the Current List of Medical Literature (CLML). They are identified by the <CitationSubset> value OM. The original subject headings are retained in <KeywordList> and the current MeSH headings reside in <MeshHeadingList>.

c. Some MESH headings have changed to Publication Types (see item 3 on September 20th announcement).

5. DOCUMENTATION

This baseline documentation is at http://www.nlm.nih.gov/bsd/licensee/2007_stats/baseline_doc.html. A list of all baseline files providing file name, years covered, file size, and record count is available. Information about maintenance of MEDLINE/PubMed records is also available.

Several articles in the September-October and November-December 2006 NLM Technical Bulletin (TB) contain details on changes for 2007, including those involving MeSH Vocabulary, and how PubMed or searching PubMed is affected.

Licensees should check NLM’s main Web page for licensees which contains links to pertinent resources and announcements. After substantive e-mails have been sent directly to licensees they are posted as announcements. Older announcements are moved to the Archive page. E-mail recipients should forward messages to appropriate staff in your organization.

6. UPDATE FILES

a. General

Files containing new, revised, and deleted records generally available 5 times per week during the production year update the baseline database. Large numbers of records reflecting various types of maintenance may be distributed in update files throughout the year. See background information about maintenance of MEDLINE records.

On occasion, an initial or changed <DateRevised> value does not get assigned to changed records. Licensees should use the annual baseline files to replace all records previously distributed and then apply all subsequent update files in ascending file name order to add new records, replace revised records, and delete records from the baseline database.

Each update file may contain no more than 300,000 records and no more than 10,000 for the first batch, called catch up files (see 6c below). Multiple files will be generated if the total number of new, revised, or deleted records for a given update exceeds the maximum number of records.

b. Order of Processing, Stats Files, and Notes Files

To maintain a complete and accurate database, all 2007 update files should be applied after the 2007 baseline files. It is critical that update files be processed in ascending numeric order based on filename to ensure that the most current version of each record is retained. Licensees should refer to the _stats.html file that accompanies each data file on the server for a breakdown of record categories and should also read occasional _notes.txt files that may appear later for additional information about the data distributed in that file, e.g., retracted publications. The MEDLINE/PubMed Update Chart for 2007 available from the Licensee Web page summarizes update files.

c. Catch-Up Files

The first group of update files is referred to as 'catch-up' files. They include all In-process and In-Data-Review Status records at the time the 2007 baseline files are loaded into PubMed plus all records which have been completed or maintained since availability of the final 2006 update file containing new MEDLINE and PubMed-not-MEDLINE status records on November 15, 2006. PMIDs for deleted records had also not been exported during this time and are included in the catch-up and subsequent update files.

d. MD5 Checksum File

An md5 check file accompanies each data file. This year the md5 file accompanying the update files will be in the same format as used for the baseline files: "MD5 (filename) = checksum". In past years the md5 format for update files was “checksum followed by filename”.

7. 2007 MeSH VOCABULARY:

The 2007 MeSH Vocabulary may be downloaded. MEDLINE/PubMed licensees may use the MeSH Vocabulary File to fully take advantage of the hierarchical nature of the controlled vocabulary in their implementations of MEDLINE. As the authority file for MEDLINE, the MESH Vocabulary File is used to validate MeSH indexing terms at data entry/input and NLM's PubMed uses translation tables and explodes derived from MeSH in order to enhance search capabilities. These enhanced capabilities are achieved through special programming at NLM which makes possible the following techniques: a) ability to search print see references, non-print see references, and certain data form abbreviations in MEDLINE as though they were the legitimate MeSH Heading; and b) ability to search chemical synonyms in MEDLINE as though they were the legitimate chemical name. The MESH file is also used to generate the MeSH Tree Number. The tree number data are necessary because they are the basis of the capability whereby MeSH terms are arranged hierarchically by subject categories with more specific terms arranged beneath broader terms. NLM uses this feature so that when MeSH terms are searched in PubMed, the program automatically includes the more specific MeSH terms. For example, a search on Hand will also include records retrieved on: Fingers, Thumb, and Wrist because those terms are 'indented' under Hand in the MeSH tree structure.

8. JOURNALS CITED IN MEDLINE/PUBMED

Information about journals cited in MEDLINE/PubMed is found in:

  1. The List of Serials Indexed for Online Users available in PDF and XML format and the List of Journals Indexed for MEDLINE available in PDF (these publications do not cover OLDMEDLINE journal titles)
  2. LocatorPlus and NLM Catalog, the NLM's online catalog
  3. Serfile, another file that may be leased from NLM (see http://www.nlm.nih.gov/databases/leased.html)
  4. PubMed journals files (contains limited journal information; updated daily)
  5. Entrez Journals database for basic journal information similar to data found in PubMed journals files (available in the Entrez Utilities; updated daily).

Last updated: 11 December 2006
First published: 06 December 2006
Metadata| Permanence level: Permanence Not Guaranteed