Skip Navigation Bar
 

Unified Medical Language System® (UMLS®)

Original Release Format Oracle Load Script

This document is a guide to loading an ORF subset of the UMLS Metathesaurus into an Oracle database by using the scripts and files created by MetamorphoSys. To generate the necessary files and scripts, select the Oracle database in the “Write Database Load Scripts” section of the "Output Options" tab.  

Recommended Software Version: Oracle 10g Release 2 (Standard Edition, Enterprise Edition), Oracle 11g (Standard Edition, Enterprise Edition)

Creating a Database

To load your data into an Oracle database, you will have to either use an existing database or create a new one.  The full complexity of creating an Oracle database is outside the scope of this document.  However, we have included some discussion regarding this process below and we do have a couple of specific recommendations.  In particular, recommend that you specify UTF8 as the CHARACTER SET and NATIONAL CHARACTER SET and configure the database to use "character" instead of "byte" semantics by default when interpreting character field lengths.  These settings are configured through three Oracle parameters:

  1. NLS_CHARACTER_SET
  2. NLS_NCHAR_CHARACTERSET
  3. NLS_LENGTH_SEMANTICS

Guided Database Creation

Creation of an Oracle database involves many configurable parameters and the full complexity of it is outside the scope of this document. We strongly recommend using the Database Configuration Assistant ("dbca") program provided as part of an Oracle installation to guide you through the process. You should find this program in the "bin" directory under your ORACLE_HOME.  This GUI-based wizard tool will guide you through the configuration process.  There is a "Character Sets" tab you will encounter on the "Initialization Parameters" configuration screen.  Here, you can program the character set and the NCHAR character set.  You should also see an "All Initialization Parameters" button that will let you configure NLS_LENGTH_SEMANTICS.  You may have to click on the "Show Advanced Parameters" button to see this particular parameter in the list.   We also recommend that you create an additional tablespace, called something like META, as the default tablespace for the database.  This is where we recommend storing the UMLS data when it is loaded.

Manual Database Creation

For detailed information on creating a database manually in Oracle, please see the "Creating an Oracle Database" section of the Oracle Database Administrator's Guide.  You may find the following samples helpful if you attempt to manually create a database.  Note that commands are in the UNIX/Linux style.  If you are creating a database under Windows, you will have to adjust these samples accordingly.  The first step is to specify an ORACLE_SID and then log into SQL*Plus to create an initial Oracle initialization parameter file.  For example:

  %  setenv ORACLE_SID UMLSDB
% sqlplus /nolog
SQL> CONNECT SYS/<sys pwd> AS SYSDBA 
SQL> -- create the server parameter file
SQL> CREATE SPFILE='/d1/oracle/dbs/spfileUMLSDB.ora' FROM
SQL> PFILE='/d1/oracle/dbs/init.ora';
SQL> SHUTDOWN
SQL> STARTUP NOMOUNT
SQL> ALTER SYSTEM SET nls_length_semantics='CHAR' scope=BOTH;

Note: the SYS password is typically configured when Oracle is installed.  Now, you are ready to issue the CREATE DATABASE command. Here is a sample create database command that addresses the character set  issues described above.

CREATE DATABASE umlsdb
USER SYS IDENTIFIED BY sys_p4ssw0rd
USER SYSTEM IDENTIFIED BY system_p4ssw0rd
CHARACTER SET UTF8
NATIONAL CHARACTER SET UTF8
LOGFILE GROUP 1 ('/d1/..../redo01.log) SIZE 500M,
        GROUP 2 ('/d1/..../redo02.log) SIZE 500M,
        GROUP 3 ('/d1/..../redo02.log) SIZE 500M
MAXLOGFILES 5
MAXLOGMEMBERS 5
MAXLOGHISTORY 1
MAXDATAFILES 100
MAXINSTANCES 1
ARCHIVELOG
EXTENT MANAGEMENT LOCAL
DATAFILE '/d2/....' SIZE 500M REUSE
SYSAUX DATAFILE '/d2/....' SIZE 500M REUSE
DEFAULT TABLESPACE META
 DATAFILE '/d1/..../meta01.dbf' SIZE 15000M REUSE
 EXTENT MANAGEMENT LOCAL
 UNIFORM SIZE 10M
DEFAULT TEMPORARY TABLESPACE TEMP
TEMPFILE '/d2/..../temp01.dbf' SIZE 5000M REUSE
 EXTENT MANAGEMENT LOCAL
 UNIFORM SIZE 10M
UNDO TABLESPACE UNDO_TBS
 DATAFILE '/d3/..../undotbs01.dbf' SIZE 2000M
REUSE AUTOEXTEND ON MAXSIZE UNLIMITED
SET TIME_ZONE = 'PST';

One consideration in the create table statement is the size of the main data tablespace datafile.  In the example above, we create a META tablespace with an initial data file called "meta01.dbf" whose starting size is 15 GB.  We recommend that you either create a very large data file like this one or configure the tablespace to use AUTOEXTEND.  Ensure your system has enough disk space to accommodate the full volume of data.

Once the database is created, you should run the admin scripts to create the data dictionary tables and the structures needed for PL/SQL procedures.  For example:

  %  setenv ORACLE_SID UMLSDB
% sqlplus /nolog
SQL> CONNECT SYS/<sys pwd>AS SYSDBA 
SQL> @$ORACLE_HOME/rdbms/admin/catalog.sql
SQL> @$ORACLE_HOME/rdbms/admin/catproc.sql

With this done, you should now be ready to create a user.

Creating a User

Whether you create a database manually or with the configuration assistant, the next step is to create your own user. It is very important that you not load data as the SYS or SYSTEM user.  We recommend that you set the default tablespace for this user to the META tablespace you created and set the user quota to UNLIMITED. After creating a user, the next step is to grant this user the appropriate appropriate privileges. For example:

  %  setenv ORACLE_SID UMLSDB
% sqlplus /nolog
SQL> CONNECT SYS/<sys pwd> AS SYSDBA 
SQL> CREATE USER umls
SQL> IDENTIFIED BY umls_p4ssw0rd
SQL> DEFAULT TABLESPACE META
SQL> QUOTA UNLIMITED ON META;
SQL> GRANT CONNECT TO umls;
SQL> GRANT RESOURCE TO umls;

Note: this process can also be performed through a GUI interface with a tool like Oracle Enterprise Manager.

Configuring Parameters for Performance

Following are a number of initialization parameter recommendations designed for an Oracle 10g system to handle the performance load of your database. This guide is not intended to be exhaustive and is not guaranteed to increase performance of your specific system.  Please take the time to understand the individual Oracle parameters to maximize the individual performance of your system.

  • Automatic Shared Memory Management (ASSM) enables you to specify a total memory amount to be used for all System Global Area (SGA) components.  The Oracle database periodically redistributes memory between components according to workload requirements. Enable ASSM by using the SGA_TARGET initialization parameter. Values in the range 1G to 4G are recommended. By specifying a nonzero value for SGA_TARGET, the following five memory pools are automatically sized:
    1. database buffer cache (Default Pool)
    2. shared pool
    3. large pool
    4. Java pool
    5. Streams pool
  • If using ASMM, you must also set the parameter STATISTICS_LEVEL to TYPICAL or ALL. Note: When SGA_TARGET is not set or is equal to zero, auto tuned SGA parameters behave as in previous releases.
  • Ensure COMPATIBLE parameter is set to 10.0.0 or higher.
  • Ensure PGA_AGGREGATE_TARGET is explicitly set to a non-zero value.  PGA_AGGREGATE_TARGET is defaulted to 20 percent of the SGA size, unless explicitly set.
  • Set SORT_AREA_SIZE and HASH_AREA_SIZE parameters.  Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Each sort in a query can consume up to the amount specified by the parameter.  Note that there can be multiple sorts in a single query. Increasing HASH_AREA_SIZE improves the performance of hash join operations. Values in the 100M to 200M are recommended.

Configuring and Running the Load Script

Windows

If you are running MetamorphoSys in a Windows environment, in your output directory (after running MetamorphoSys with Oracle selected in the “Write Database Load Scripts” section of the "Output Options" tab) you should see a batch file script called "populate_oracle_db.bat".  This script can be used to load your subset into an Oracle database.   The directory will also contain a SQL script for creating the tables, a SQL script for creating the indexes, a SQL*Loader control file (.ctl) for each of the data files to be loaded, and a script for creating the join of MRCON and MRSO (MRCONSO). 

SQL*Loader is the Oracle tool for loading text files into a database and has a variety of parameters and configuration options (this too is outside the scope of this document).  The control files used by the script to load each file into an Oracle database table are designed to be relatively simple and relatively efficient.  If you are interested in more information, you are encouraged to look up the SQL*Loader Control File Reference in the Oracle manual.

Edit the script and configure the four parameters at the top of the script.  These are:

set ORACLE_HOME=<path to ORACLE_HOME>
set user=<username>
set password=<password>
set tns_name=<db_name>

ORACLE_HOME should be set to the installation directory of your local Oracle client or server.  This may be something like "D:\oracle\product\10.2.0\db_1".  After configuring these four parameters, this section of your script may look something like the following:

set ORACLE_HOME=D:\oracle\product\10.2.0\db_1
set user=myuser
set password=p4ssw0rd
set tns_name=UMLSDB

You are now ready to run the script.  To do so, navigate to your subset directory using the Windows explorer and double-click on the "populate_oracle_db.bat" file you just finished editing.

Here are some things to keep in mind when using this script:

  1. Your Oracle installation needs to be able to find the database associated with the "tns_name" you specified.  This is typically done through a "tnsnames.ora" file in the network/admin sub-directory of ORACLE_HOME.  If you created a database using the Database Configuration Assistant, this was probably done for you already.  If not, you may have to configure it yourself using a tool like Net Manager. To test the connection, you can use "tnsping" in the bin directory of ORACLE_HOME.
  2. The script will automatically create a log file called "oracle.log" in the same directory as the script.  If you are planning on running the script more than once, you should consider renaming the log file before your second run, otherwise the contents of the first run will be lost.
  3. While the script is running, you can monitor the process from within the SQL*Plus command shell by using "SQL> select sid, serial#, username, program  from v$session;"
  4. The load may take many hours (especially if you have selected a large subset). When it completes, each of the tables will be loaded, and an initial set of indexes will have been created.
  5. If you are running Oracle 10g and have the recyclebin parameter to ON, drop table statements in the load scripts do not purge the recyclebin area.  If you are using scripts to reload the UMLS, be aware that at the conclusion of the load, it may be desirable to "purge recyclebin" to ensure you are not keeping unwanted data around.

Linux, Macintosh, or Solaris

If you are running MetamorphoSys in a Linux, Macintosh, or Solaris environment, in your output directory (after running MetamorphoSys with Oracle selected in the “Write Database Load Scripts” section of the "Output Options" tab) you should see a .sh shell script called "populate_oracle_db.sh".  This script can be used to load your subset into an Oracle database.   The directory will also contain a SQL script for creating the tables, a SQL script for creating the indexes, a SQL*Loader control file (.ctl) for each of the data files to be loaded, and a script for creating the join of MRCON and MRSO (MRCONSO).

SQL*Loader is the Oracle tool for loading text files into a database and has a variety of parameters and configuration options (this too is outside the scope of this document).  The control files used by the script to load each file into an Oracle database table are designed to be relatively simple and relatively efficient.  If you are interested in more information, you are encouraged to look up the SQL*Loader Control File Reference in the Oracle manual.

Edit the script and configure the four parameters at the top of the script.  These are:

export ORACLE_HOME=<path to ORACLE_HOME>
user=<username>
password=<password>
tns_name=<tns_name>

ORACLE_HOME should be set to the installation directory of your local Oracle client or server.  This may be something like "/u01/app/oracle/product/10.2.0".  After configuring these four parameters, this section of your script may look something like the following:

export ORACLE_HOME=/u01/app/oracle/product/10.2.0
user=myuser
password=p4ssw0rd
tns_name=umls

Note: passwords containing $ or # characters may cause problems as these have special meaning within the shell environment.  In this case, you may want to enclose the password in quotes in the script itself (e.g. password="pass#word"). You are now ready to run the script.  To do so, navigate to your subset directory and invoke the script from the command line.  It may be necessary to change the file permissions to make the script executable before it can be started. For example:

% cd <subset directory>
% chmod 775 populate_oracle_db.sh
% populate_oracle_db.sh &

Here are some things to keep in mind when using this script:

  1. Your Oracle installation needs to be able to find the database associated with the "tns_name" you specified.  This is typically done through a "tnsnames.ora" file in the network/admin sub-directory of ORACLE_HOME.  If you created a database using the Database Configuration Assistant, this was probably done for you already.   If not, you may have to configure it yourself using a tool like Net Manager. To test the connection, you can use "tnsping" in the bin directory of ORACLE_HOME.
  2. The script will automatically create a log file called "oracle.log" in the same directory as the script.  If you are planning on running the script more than once, you should consider renaming the log file before your second run, otherwise the contents of the first run will be lost.
  3. While the script is running, you can monitor the process from within the SQL*Plus command shell by using "SQL> select sid, serial#, username, program  from v$session;"
  4. The load may take many hours (especially if you have selected a large subset). When it completes, each of the tables will be loaded, and an initial set of indexes will have been created.

Sample Queries

We have provided a number of sample queries to demonstrate how to access certain kinds of data.

Get the English preferred name for a UMLS CUI.
SELECT str FROM MRCON
WHERE cui = ? AND stt = 'PF'
AND ts = 'P' AND lat = 'ENG';

Get MRCONSO rows (atoms) for a SAB and CODE.
SELECT * FROM MRCONSO
WHERE sab = ? AND code = ?
ORDER BY cui, lat, lower(ts), lui, stt;

Get all MRCONSO rows (atoms) for a CUI, in precedence order.
SELECT * FROM MRCONSO
WHERE cui = ? ORDER BY cui, lat, lower(ts), lui, stt;

Get all MRCONSO rows (atoms), with STYs, in precedence order.
SELECT a.*, sty FROM MRCONSO a. MRSTY b
WHERE cui = ?
ORDER BY cui, lat, lower(ts), lui, stt;

Perform a norm string search in MRXNS_ENG to get CUI.
SELECT a.* FROM MRCONSO a, MRXNS_ENG b
WHERE a.cui = b.cui and nstr = ? and a.sui = b.sui;

Find all Semantic Types for a particular string.
SELECT sty FROM MRCON
a, MRSTY b WHERE a.cui = b.cui AND str = ?;

Find all contexts for a particular "atom". (Note: this requires creating an ORF subset with a MRCXT file).
SELECT * FROM MRCXT
WHERE cui = ? AND sui = ?
ORDER BY cui, sui, code, cxn, cxl, rnk;

Find all definitions for a particular string.
SELECT def FROM MRCON a, MRDEF b
WHERE a.cui = b.cui AND str = ?;

Find preferred MUI of a MeSH D#.
SELECT atv AS mui FROM MRSAT a, MRSO b
WHERE a.cui = b.cui
      AND a.sui = b.sui
      AND b.sab = 'MSH'
      AND b.tty = 'MH'
      AND atn = 'MUI'
      AND b.code = ?;