2012 MEDLINE®/PubMed® Baseline Distribution
December 14, 2011
This documentation pertains to the 2012 MEDLINE/PubMed baseline database produced after NLM’s end-of-year maintenance in November 2011. Links for this documentation, DTDs, data element descriptions, update file information, announcements, and other resources are available from the NLM Web page for MEDLINE/PubMed licensees.
The 2012 baseline database of 20,494,848 records resides in files medline12n0001 through medline12n0684 and completely replaces all 2011 MEDLINE/PubMed data files previously distributed.
The baseline files are grouped by publication years. There are some records in the baseline files with a 2012 PubDate. A list of all baseline files providing file name, years covered, file size, and record count is available. All records in the OLDMEDLINE subset (defined as <CitationSubset> value = OM) reside in the pre-1966 files.
All baseline records are in MEDLINE, OLDMEDLINE, or PubMed-not-MEDLINE MedlineCitation Status and are completed records (quality reviewed and contain the DateCompleted element).
The baseline files do not contain records in the two additional MedlineCitation Statuses distributed to licensees in update files: In-process or In-Data-Review. Records in those statuses begin to be distributed in the first batch of update files referred to as the 'catch-up' files discussed in item 4c below.
Approximately 2% of records in PubMed do not reside in NLM’s Data Creation and Maintenance System (DCMS), thus are not distributed to MEDLINE/PubMed licensees in baseline or in subsequent update files. In PubMed those are the records in MedlineCitation Status = Publisher and are retrieved by the search strategy: publisher [sb].
A DateRevised date of November 17, 2011 was assigned to the nearly 455,000 records that were maintained during year-end processing. The 2012 baseline files completely replace all previously distributed MEDLINE/PubMed records. If you have loaded previously distributed MEDLINE/PubMed data files, do not attempt to replace records based on the DateRevised element; a complete re-load is necessary.
2. BASELINE DATA VIA FTP
The 2012 MEDLINE/PubMed baseline database comprised of 684 data files in compressed format is distributed from an NLM FTP server. An md5 check file accompanies each data file. FTP access information provided to licensees is for NLM MEDLINE/PubMed licensees only; do not share directory or file names with others. Be sure to use the IP address you registered with NLM as all other IP addresses are blocked from accessing the files. A listing of all baseline files providing file name, years covered, file size, and record count is available.
3. DTD AND DATA CHANGES FOR 2012
See the announcement dated 8/31/11 for information about DTD and XML changes for 2012. The DTD changes for this year are also summarized in the Revision Notes at the top of the DTD. Also see the MEDLINE Data Changes article in the Nov-Dec 2011 NLM Technical Bulletin.
4. UPDATE FILES
Information about maintenance of MEDLINE/PubMed records is available. This page contains guidelines for processing update files which contain new and revised records and PMIDs of deleted records.
Files containing new and revised records and PMIDs of deleted records update the baseline database and are generally available 5 times per week, Tuesday through Saturday, during the production year. On occasion, an initial or changed <DateRevised> value does not get assigned to revised records. Large numbers of records reflecting various types of maintenance may be distributed in update files throughout the year.
After the annual baseline files are loaded, licensees should apply all subsequent update files in ascending file name order to add new records, replace revised records, and delete records.
Each update file may contain no more than 30,000 records. Multiple files will be generated if the total number of new, revised, or deleted records for a given update exceeds the maximum number of records.
b. Order of Processing, Stats Files, Notes Files
To maintain a complete and accurate database, 2012 update files should be applied after the 2012 baseline files are processed (and after records obtained from PubMed using the special text file of PMIDs discussed in item 5 below are added, if you opt to do so). It is critical that update files be processed in ascending numeric order based on filename to ensure that the most current version of each record is retained. Licensees should refer to the _stats.html file that accompanies each data file on the server for a breakdown of record categories, and should also read occasional _notes.txt files that may appear later for additional information about the data distributed in that file, e.g., retracted publications. An md5 check file accompanies each data file. The MEDLINE/PubMed Update Chart, also available from the Web page for MEDLINE/PubMed licensees.
c. Catch-Up Files
The first group of update files (not including the special text file of PMIDs discussed in item 5 below) is referred to as 'catch-up' files. They include all In-process and In-Data-Review Status records at the time the 2012 baseline files are loaded into PubMed plus all records which have been completed or maintained (but not released) since the final 2011 update file containing new MEDLINE and PubMed-not-MEDLINE status records was available on November 17, 2011. PMIDs for deleted records had also not been exported during this time and are included in the catch-up and subsequent update files. This year the catch-up files are medline12n0685 through medline12n0702.
5. ADDITIONAL RECORDS IN PUBMED AND NOT DISTRIBUTED TO LICENSEES IN UPDATE FILES
An additional file is available to licensees. It is a text file of PMIDs of records in MedlineCitation Status = In-Process and MedlineCitation Status = In-Data-Review which have been retained in the 2012 version of PubMed at the time the 2012 baseline files were loaded and which are not exported to licensees in the first group of update files at this time. This represents a small number of records, but licensees who wish to create a database as close as possible to the record content in PubMed at this time will want to include them. These records should eventually be re-exported as completed records in MedlineCitation Status = MEDLINE or MedlineCitation Status = PubMed-not-MEDLINE after NLM completes the work on them.
Licensees may use the Entrez Utilities to download the records from PubMed using the list of PMIDs. IMPORTANT: If you elect to add these records to your version of MEDLINE/PubMed, they must be added to the baseline files prior to processing any of the routine update files to ensure retaining the most current version of those records. See the MEDLINE/PubMed access instructions for further instruction.
This special file resides in both the .gz and .zip directories in the directory containing update files.
Again, keep in mind that records in PubMed in MedlineCitations Status = Publisher and retrieved by the PubMed search strategy: Publisher [sb] are never exported to licensees.
6. 2012 MeSH VOCABULARY:
The 2012 MeSH Vocabulary may be downloaded. MEDLINE/PubMed licensees may use the MeSH Vocabulary File to fully take advantage of the hierarchical nature of the controlled vocabulary in their implementations of MEDLINE. As the authority file for MEDLINE, the MESH Vocabulary File is used to validate MeSH indexing terms at data entry/input and NLM's PubMed uses translation tables and explodes derived from MeSH in order to enhance search capabilities. These enhanced capabilities are achieved through special programming at NLM which makes possible the following techniques: a) ability to search print see references, non-print see references, and certain data from abbreviations in MEDLINE as though they were the legitimate MeSH Heading; and b) ability to search chemical synonyms in MEDLINE as though they were the legitimate chemical name. The MESH file is also used to generate the MeSH Tree Number. The tree number data are necessary because they are the basis of the capability whereby MeSH terms are arranged hierarchically by subject categories with more specific terms arranged beneath broader terms. NLM uses this feature so that when MeSH terms are searched in PubMed, the program automatically includes the more specific MeSH terms. For example, a search on Hand will also include records retrieved on: Fingers, Thumb, and Wrist because those terms are 'indented' under Hand in the MeSH tree structure.
7. JOURNALS CITED IN MEDLINE/PUBMED
For information about journals cited in MEDLINE/PubMed see the NLM FAQ: Finding NLM Serials Data and MEDLINE Indexed Journals.