2019 NLM Informatics Training Conference
Welcome to the NLM Informatics Training Conference!
The National Library of Medicine supports research training in biomedical informatics and data science at sixteen (16) educational institutions in the United States. These training programs offer graduate education and postdoctoral research experiences in a wide range of areas including: health care informatics, translational bioinformatics, clinical research informatics, public health informatics. In all of these areas, biomedical data science concepts and methods are part of the core curriculum. Seven programs also offer special tracks in environmental exposure informatics.
Each year an Informatics Training Conference is convened to bring NLM trainees together to showcase their work, to evaluate the full scope of current work in the field, and to meet their peers. The 2019 NLM Informatics Training Conference will be held at Indiana University in Indianapolis, IN.
Trainees appointed at the Veterans Administration (VA) sponsored training programs, NLM’s intramural trainees, and informatics trainees at the NIH Clinical Center are also invited to attend and make presentations.
2019 NLM Training Conference Participants Photo credit: Phil Lofton, Regenstrief Institute
Best Plenary (Clinical informatics Applications & Quality Plenary): Rachel Stemerman
Title: Using Interactive Data Visualization to Drive Quality Improvement in Pre-Hospital Emergency Medical Services
Best Plenary (Bioinformatics & Computational Biology Plenary): Lauren Baker
Title: Exploiting Biological Priors to Enhance GWAS in a Dog Model of ACL Rupture
Abstract: Anterior cruciate ligament rupture (ACLR) is a common condition that disproportionately affects young people, 50% of whom will develop knee osteoarthritis (OA) within 10 years of rupture. ACLR has both genetic and environmental risk factors. The genetic basis of ACLR remains unexplained. Spontaneous ACLR in the dog has a similar disease presentation and progression. Breed predisposition supports a genetic influence. The dog is a valuable genomic model for ACLR, as extensive linkage disequilibrium in dogs facilitates genome-wide association study (GWAS). Biologically-relevant priors can be assigned in Bayesian mixture model (BMM) analysis to aid locus discovery. RNA sequencing was performed on ACL and synovium tissues from four ACLR affected and four control dogs. After correction for multiple testing, 186 and 374 differentially expressed genes (DEGs) were identified between ACLR case and control samples in ACL and synovium tissue respectively. Biological priors were incorporated into GWAS analysis by assigning SNPs within differentially expressed genes to separate mixture classes using the BMM algorithm BayesRC. Moderate effect SNPs were identified within genes that have roles in extracellular matrix homeostasis, inflammatory disease, gene transcription, and DNA replication. While these results are consistent with previous work, they are potentially explainable by mechanical inflammation as a result of the spontaneous rupture event rather than disease process. RNA sequencing of synovial biopsies from 8 dogs with spontaneous knee injury not associated with progressive disease (luxated patella) will be incorporated into differential expression analysis to control for the effect of mechanical inflammation on results. Repeat BMM GWAS analysis will then be performed. A better understanding of how gene expression profiles are altered in ACLR affected dogs, and identification of associated variants within DEGs through BMM analysis will help clarify the underlying genetic basis of ACLR. Genetic discovery in this model may help explain ACLR in human beings.
Best Plenary (Precision Medicine Plenary): Michael Ding
Title: Predicting Cancer Drug Effectivness with Deep Learning
Abstract: Introduction. The future of precision medicine in cancer depends on the continued development and proper application of molecular, biologic, and immunological treatments. Traditional nonspecific therapies typically lack a precision component and are often administered on a trial and error basis. Meanwhile, targeted therapies are prescribed based on imperfect single-gene biomarkers. There exists a critical need for accurate companion diagnostic tests to guide the application of new and existing treatments. In this study, we develop a novel deep learning-based approach utilizing genomic and transcriptomic data for predicting tumor response to cancer medication. Methods. Using data from the Genomics in Drug Sensitivity in Cancer (GDSC) and The Cancer Genome Atlas (TCGA), we created a computational model for predicting drug sensitivity from integrated omics data. In a semi-supervised fashion, the model utilizes deep learning to construct and incorporate significant features from TCGA and leverages drug screening information from GDSC to learn and generate predictions. Results. This approach has successfully predicted drug sensitivity with high accuracy for 260 different nonspecific and targeted compounds in a variety of preclinical and clinical models. To date, external validation has been performed in immortalized pan-cancer cell lines, patient-derived pancreatic cancer primary cells, and patient-derived bladder, liver, and colorectal cancer organoids. Evaluation in clinical tumors is currently in progress. Discussion. To our knowledge, these are the first models to successfully integrate several classes of clinical omics features with cell line drug response data in the task of predicting drug sensitivity. The results of this study demonstrate the power of deep learning for modeling complex interactions in high-dimensional biomedical datasets.
Best Focus Talk Day 1: Andrew King
Title: Using Machine Learning to Highlight Relevant Data in a Learning EMR
Abstract: The vast amount of data stored in a patient’s Electronic Medical Record (EMR) makes it difficult for clinicians to get the information they need. We use machine learning to predict which patient data a clinician is likely to seek and we highlight those data in the EMR. In this talk, I report how our “Learning” EMR (LEMR; pronounced lemur) system is trained and what happened when clinicians used it. To collect training data, eleven critical care clinicians (intensivists) used a custom-made EMR display interface to prepare a rounding presentation for 178 de-identified intensive care unit (ICU) cases. We recorded what information clinicians needed for each patient using a combination of eye-tracking and user provided annotations. The LEMR machine learning model takes as input a vector representation of a patient case (patient demographics, test results, etc.) and it provides as output a list of patient data that a clinician is likely to seek (e.g. the patient’s heart rate, temperature, and the antibiotic dosing regimen). To evaluate the models, we conducted an evaluation study featuring 12 intensivists who each prepared a rounding presentation for 18 new patient cases. The intensivists were assigned to control and experimental groups where the interface for the experimental group highlighted data based on model output. We used the information seeking behavior of the control group as a gold standard for evaluating model performance. When considering 25 different pieces of patient data, the LEMR model achieved precision of 0.52 (95% CI 0.49, 0.54) and recall of up to 0.77 (95% CI 0.75, 0.80). For data not highlighted by the system, clinicians rated the effect of not seeing those data as no or minor impact in 81.9% of the cases. Data-driven approaches for adaptive EMR systems show promise for supporting clinical decision making and enhancing user experience with EMRs.
Best Focus Talk Day 2: Lisa Grossman
Title: A Method for Harmonization of Clinical Abbreviation and Acronym Sense Inventories
Abstract: Previous research has developed methods to construct acronym sense inventories from a single institutional corpus. Although beneficial, a sense inventory constructed from a single institutional corpus is not generalizable, because acronyms from different geographic regions and medical specialties vary greatly. Here, we developed an automated method to harmonize sense inventories from different regions and specialties towards the development of a comprehensive inventory. The method involves integrating multiple source sense inventories into one centralized inventory and cross-mapping redundant entries to establish synonymy. To evaluate our method, we integrated 8 well-known source inventories into one comprehensive inventory (or metathesaurus). For both the metathesaurus and its sources, we evaluated the coverage of acronyms and their senses on a corpus of 1 million clinical notes. The corpus came from a different institution, region, and specialty than the source inventories. In the evaluation using clinical notes, the metathesaurus demonstrated an acronym (short form) micro-coverage of 94.3%, representing a substantial increase over the two next largest source inventories, the UMLS LRABR (74.8%) and ADAM (68.0%). The metathesaurus demonstrated a sense (long form) micro-coverage of 99.6%, again a substantial increase compared to the UMLS LRABR (82.5%) and ADAM (55.4%). Given the high coverage, harmonizing acronym sense inventories is a promising methodology to improve their comprehensiveness. Our method is automated, leverages the extensive resources already devoted to developing institution-specific inventories in the United States, and may help generalize sense inventories to institutions who lack the resources to develop them. Future work should address quality issues in source inventories, explore additional approaches to establishing synonymy, and evaluate the metathesaurus' utility for acronym sense disambiguation.
Best Open Mic: Pavan Kota
Title:Generalized Microbial Sensing with Algorithmically Designed DNA Probes
Abstract: In time-sensitive clinical infections, clinicians must often prescribe prophylactic broad-spectrum antibiotics before confirmatory diagnosis because of the limitations of existing pathogen identification systems. The pathogen-specific sensors found in most molecular diagnostic tests necessitate a unique sensor for every possible microbial target. In contrast, direct DNA sequencing can identify any known pathogen but suffers from high costs and turnaround times. We are developing an unconventional alternative in which nonspecific DNA probes can identify bacterial targets through compressed sensing where the number of required probes scales logarithmically with the number of targets. The probes bind with partial complementarity at multiple locations along whole bacterial genomes. Each species is given a characteristic fingerprint, or affinity vector, based on the expected number of binding events of each probe to its genome.
While previous work utilized random probe sequences, we now present a new approach for the application-driven design of optimal probe sequences to strategically separate bacterial affinity vectors. With methods from natural language processing and machine learning, we are developing an approximation of affinity vectors that is several orders of magnitude faster than standard thermodynamic simulations, enabling rapid iteration in a genetic algorithm. Importantly, any optimality criteria can be easily substituted into our algorithm. With our probe design framework, we can maximize the angular spacing of affinity vectors to enable greater accuracy in recovering individual bacterial concentrations from mixed samples, highlighting the potential for diagnosing polymicrobial infections. We are also currently investigating new metrics to design DNA probes such that species’ fingerprints preserve the taxonomic relationships between them. If successful, entirely novel pathogens could still be immediately mapped to their closest relatives to provide first responders with clinically relevant information to make informed decisions.
Acknowledgement: NLM Training Program in Biomedical Informatics and Data Science T15LM007093, Director Dr. Lydia Kavraki.
Best Poster Day 1: Garrett Eickelberg
Title: Novel MIMICii Data Pipeline for Patients with Serious Bacterial Infections
Abstract: The overall goal of this project is to develop data-driven methods to predict patient-centered SBI risk. Producing a generalizable SBI risk model from electronic health record (EHR) relies on having In order to produce a generalizable SBI risk model from electronic health record (EHR), we first had to develop a flexible data pipeline capable of extracting raw clinical data and outputting relevant features for model building. This poster will detail the methods and challenges associated with creating this pipeline using the MIMICIII dataset. The MIMICiii dataset contains de-identified electronic health record (EHR) data from over 15,000 adult patients receiving antibiotic therapy in the ICU at the Beth Israel Deaconess Medical Center between 2001 and 2012. The MIMIC dataset has been used extensively to develop and demonstrate the feasibility and utility of predictive analytics to aid in solving clinical problems (1). Working towards our goal of developing patient-centered SBI risk quantification methods, we developed a novel data pipeline for MIMICiii data that can:  identify a cohort of qualifying patients,  wrangle sparse longitudinal clinical data across specified clinical time windows,  perform data cleaning tasks (standardize units, remove erroneous values, data transformation, etc…)  join all data into single a table, and  flatten predictors into single vectors per patient via clinically guided aggregation. The assembly of this pipeline provides us with flexible framework to assemble and model datasets that can be tailored to accommodate different types of relevant time windows, aggregation methods, response variables and predictor variables. Constructing this pipeline has furthered my understanding of clinical data structures, data science project architecture, and common clinical data quality issues, such as the complexity of aggregating data from multiple sources and dealing with data entry errors.
Best Poster Day 2: Ramiz Iqbal
Title: Clinical Phenotyping Using Semantic Knowledge Graphs
Abstract: Clinically phenotyping a patient as having a particular disease is often a difficult task to conduct at scale and a persistent problem for healthcare systems. Automating the task of phenotyping is confounded by the need for deep expertise in particular domains and the terminology itself. The Unified Medical Language System (UMLS), a controlled vocabulary developed for the task of extracting biomedical concepts, has been utilized to overcome some of the aforementioned issues. However, the UMLS is known to be fraught with inconsistencies and errors which has affected the ability of researchers to use this vocabulary to automate tasks.
We propose to construct a knowledge graph where the relatedness of UMLS terms is determined by their semantic relationship in biomedical literature rather than where they are in the UMLS hierarchy. Basing the relationship between the terms on the literature filters the UMLS for the most utilized terms for a specific disease. The approach we suggest would both help overcome the variability inherent to the UMLS terminology and the lack of deep domain knowledge needed to phenotype a patient. We will also construct the knowledge graph such that it learns from new biomedical literature and the task of phenotyping patients improves over time as well as adapts to changes in the state of knowledge.
A patient would be phenotyped by comparing the UMLS terms in a patient’s EHR to the knowledge graph to yield a probability of the patient having a phenotype, e.g., breast cancer. This process can be conducted across thousands of records in far less time than the gold standard of having a physician review the patient’s record.
Lastly, our method is based on literature and can be expanded to phenotype other disease, therefore it is widely applicable to any healthcare system and disease.
Last Reviewed: July 13, 2020