Good afternoon. I’m Tom Rindflesch. I work in research in natural language processing in the Lister Hill Center, the research and development division of the Library. I would like to introduce an application we’re working on, which is still not up for production, but is -- but is under development and we are hoping to bring out at some point. It is called Semantic MEDLINE. And the idea is that this is an information management application that manipulates information in addition to documents. So the particular way that we’ll use this is to demonstrate a way of managing the results of PubMed searches. So, in a way it helps the user decide what to read from the results of a large PubMed search, which as you well know can easily be hundreds, tens of thousands of citations. In addition, this connects knowledge from various sources and integrates application interfaces.

So a rough overview of how it might work, it sits totally on top of PubMed/MEDLINE. So the idea ultimately will be to have all the resources available, but in addition, after retrieval you will summarize with natural language processing applications that produce a visual graphic network of relationships. And then you can -- and this is specifically the aspect that we call Semantic MEDLINE. Once you have done that you can then choose some particular relationship that you are interested in, and this maintains links to the documents that produced that information in addition to other information that I’ll put into greater detail in just a second.

So this is seamless integration of NLM’s technologies, including information retrieval (the familiar PubMed/MEDLINE) and natural language processing, which is under development, a program called SemRep that I have been a responsible for for a number of years, which is now beginning to come into fruition, to be applicable in a practical way. And this represents the content of text with semantic predications. The term semantic predications is loved by linguists (which I am one), and I’ll explain it in greater detail and specificness in just a minute. So on top of that there is a process called abstraction summarization, which takes the information that is extracted from text and boils it down to sort of the most salient...what the system considers the most important information, which you can then choose and navigate around in. And finally this information is visualized with indicative links to source text and additional information.

So to give you a rough overview of how this works, the columns here are the NLM technologies. So we have off on the left MEDLINE/PubMed. Then we are going to look at the natural language processing, the summarization and the visualization.

So in a bit more detail we might do a familiar search, say breast cancer. We get some text, things like this, stuck inside of MEDLINE citations. Aromatase inhibitor provides mortality benefit in early breast carcinoma. Some other citation might talk about the genetics of this. Determined the spectrum and frequency of ATM missense variants in 443 breast cancer patients. So SemRep, the semantic interpretation aspect, takes the information in the text and turns it into a normalized representation, and this is what is called the predication. So this aromatase inhibitors treats breast carcinoma. That is known as a predication. It is kind of a clean, normal, formalized way of representing the complexity and the nastiness of natural language text. So the system does this for as much as it can in all the text that it sees -- this is typical of what it gets. It gets things in the clinical area, such as the first predication here and also in the genetics area, such as the ATM gene is associated with breast carcinoma. So it does that then for all of the text. It doesn’t get every single thing that is asserted in the text, but it gets as much as it can, and then it summarizes that.

So what it does is it pinpoints or marks what it considers the information to be most salient to the text of this search. And what it does with that is it takes a list of those predications on one side and runs them through this process and boils them down and eliminates the ones that it doesn’t consider to be as important as others. So it does that. The user specifies a topic. In this case we say breast cancer, and then the system retains predications on that topic, eliminates uninformative predications, and retains the most frequent predications, which you can either retain or not, as we will see. So in this case it would keep things like Tamoxifen treats breast carcinoma, which is specific and potentially interesting, but it would eliminate those at the bottom, like tamoxifen treats patients. This is something, that if we want a summary, we don't need to be told that. Nor that breast carcinoma is a process of an individual. We don't need to be told that if we are looking for a summary. Something has to go. So those are eliminated. And then, finally, it represents this information as a graph with links to other information. So we link back to the text itself, also to the UMLS, and structured biomedical data. So far we link to the Genetic Home Reference, which Sonya just showed you, and also to Entrez Gene, NCBI’s huge representation of genetic information. So remember we have these predications. We have tamoxifen treats breast carcinoma, aromatase inhibitors treats..., ATM Gene associated with... And we maintain links.

So now I’d like to get you a demo of this live to show you how it works. So we have this -- this is a demo now, and the thing is, I am showing you it live, but the MEDLINE searches are not live because in order for this to be -- for this to truly work we have to preprocess all of the 15 million or so citations in MEDLINE, and we don't have that done yet, and we don’t also have this implemented and ready for the general public. But this demo is available to anyone and you could play around with this much. So the thing is, what it amounts to, the demo part of it, is that the MEDLINE searches have been preprocessed. We have done searches of 500 citations, the most recent citations for certain dates. And let's see. I am going to demonstrate with Alzheimer's disease. We do what would be the search. In this case it just goes and actually gets the citations. And they are here. The titles are listed here. The PubMed IDs are listed here, and if you want to actually go to PubMed and look at the real citation, it brings up -- I think this is the abstract plus, and so you have access to full text if you want to. But for now what we are going to do is actually go through the Semantic MEDLINE processing and we are going to summarize this and represent the predications as a graph. So we have various options here. We can summarize from various points of view. So if you have text on breast cancer, you might say I am interested in treatment of breast cancer or I am interested in substance interactions involved with breast cancer, perhaps the genes, proteins, drugs that are involved, or you might be interested in diagnosing breast cancer or the pharmacogenomics, that is the kind of cutting edge genetic research which indicates how a person's individual genetic makeup interacts with drugs. So for now we are going to start with looking at treatment of disease, and we can also pick the topic. In this case I searched on Alzheimer’s disease, so I am actually going to summarize this on Alzheimer’s as well. Now we can also keep only the most frequent predications or we can decide to keep all of them that SemRep retrieved. Initially, to show you how the system works I will say keep only the most frequent. So I click here, summarize and visualize, and I get here a –-

let's see -- This is a laptop. I shouldn’t say it -- Really, a desktop would be better to have more real estate, but we work with what we’ve got.

We get a graph which represents all of the predications that SemRep extracted, and it shows – we’re to read these, we’re to look at a node. And all of these nodes (that is the boxes) are color-coordinated by semantic group from the UMLS. These light yellow ones are substances or drugs. These pinkish ones (light mauve, say) are disorders. And let's see. The blue here are groups of individuals. So we have participants, African Americans, and these darker ones are genes. So you look at one of the nodes and then you read that by saying, “what does this say?” by looking at the direction of the arrow and the color off on the right. So here I see the APP Gene. This is red. So, causes Alzheimer's disease. If I am interested in treatment I may say -- let's see. The antipsychotic agents treat Alzheimer's disease. Something else, depressive disorder coexists with Alzheimer's disease. And I can also see that if I click on one of the nodes I can see down in the right how many citations this occurred in, and how many predications or relationships were extracted with this. So I can just glance at this and get an overview of the information that was in all 500 of those citations on Alzheimer's disease, kind of the most important information in just one glance. This, in the research, is called the informative aspect of this summary. There is something called the indicative aspect, if I wanted to drill down and get more information. I could say, evidently, depressive disorder, co-occurring with Alzheimer’s, is a rather important aspect of current research in Alzheimer’s disease. So if I want to look at some of the citations on that, I go here. I right click on that line and then when I say display citations it brings up the citations down here and the sentence from which the -- that relationship -- depressive disorder co-occurs with Alzheimer’s, occurs. And I can see that I can now look at the research that is involved. Let's say that I have to – there were like three citations. Oops -- let me. So here I can go down and see what kind of -- you know, what is actually being said in these papers. I see that depression is a big aspect of Alzheimer's.

Now, this kind of gave me a rough overview. But, let's say I want a little bit more information about Alzheimer’s, the treatment. What are some of the cutting edge latest research in the treatment of Alzheimer's disease? What I can do in order to get that, is I can go back and resummarize, but I can say, this time, turn this, don't filter out just the most frequent predications, but rather, give me all of them. So if I go back and summarize that, I will get a rather dense, a rather large graph. You will see in a minute, but let me show you how to (-- our Internet connection here is not absolutely the top notch. It is taking a minute to reload.) I get this graph which is now rather dense and rather large and I said, this is too much. What I can do is I can go -- I am only really interested in Treats. So I can get rid of the other predications. I can click these off. Causes, coexist with, isa, location of, process of, and now I am going to have left some -- quite a few predications about the treatment of Alzheimer's disease. And let me just -- I can move back here. Some of the -- the thing is, what I want to point out to you is how this allows me – now, I am not -- I have a PhD in linguistics, I actually am not a medical person at all, but I can see here that there are iron chelating agents. It says iron chelating agents treats Alzheimer's disease. This is something I have never heard of and was not aware. I am not exactly sure what an iron chelating agent is. I have heard of it with lead poisoning. It has to do with an agent that goes in and pulls out the iron. So let's take a look at this and see what the citation has to say. So here we see that (-- this connection is a little bit slower and it takes awhile to refresh). We see that this is an article published a few months ago last fall talking about nanoparticle iron chelators: a new therapeutic approach in Alzheimer's disease and other neurologic disorders associated with trace mineral imbalance. So the thing is, this sort of shows us, this is the latest research that says that accumulating evidence suggests that oxidative stress may be a major etiologic factor in initiating Alzheimer's disease, which is something I think sort of cutting edge research in Alzheimer's disease. This would be difficult to find for the normal person unless you were an expert in Alzheimer's disease therapy. The system sort of hands this to you on a silver platter, as it were.

Before I let you go, I want to show you one more aspect of the system and looking at -- let's look at -- I talked about substance interactions. Let's re-summarize the documents on substance interactions. Notice this list of topics now changes to focus not on diseases or drugs, but rather on actual genes. So now I have at the top, the APP gene, which is the most frequent, it is well known that this is associated with Alzheimer's disease, but the thing I want to point out is that we’re really not using this top-level. We may know that the APP gene is associated with Alzheimer’s disease, but we want to find out, what is the latest research on the APP gene and Alzheimer’s disease. So when we summarize for that, so we find down here -- let's see. We say that this -- here is the APP gene that is associated with the CDK5 gene. So we want to take a look at the citation associated with that, but furthermore, remember I said we could link to other NLM information. So this CDK5 gene, we can look in Entrez Gene and see what information is associated with this gene in Entrez Gene.

(So let's see -- it is going to pull it up? Let's try that again. The connection is just a little bit slow. Oh, I see what my problem has. I have the pop-ups blocked. Unfortunately -- options? Okay. Oh, yes. That is it. Thanks, Kevin. I always tell people that I am a theoretician and not really a technical specialist.)

If you go to Entrez gene and we see CDK5, the gene, I think what is interesting here is that when we go to GeneRIFs which as you know, these are the -- sort of the indexers add this, reference into function, snippets of text from citations associated with this gene, giving some information as to this gene’s function, and what’s sort of interesting is that when we search here, we don't find this CDK gene. I don't think it is associated yet with -- let me actually search to see if it talks about -- okay. There is something about Alzheimer’s disease, which, I guess, is good. The thing is, we can go back and say, well, we have as much information as we could about this CDK5 gene from Entrez Gene. Let's see what the latest research has to say about this, with its interaction with APP, and I think what is interesting is that, this research here is about – TNFalpha involved with inflammation, which is one of the hottest new topics about neurodegenerative diseases in inflammation, in causing these, and so this TNFalpha, which is somehow related with this APP gene and the CDK5 gene, and in the bottom I think that these authors suggest that this is research leading toward new pathways toward treatment of Alzheimer's disease.

So due to time I am going to stop there with the overview and take questions, and if you have any further – and you can come up and get the URL from me if you want to play around with this, and we would give you where we are headed with this and let me help you to make it available. So thanks. I think we are probably getting toward the end. Janet is probably about ready to have -- questions? Okay. Let me put it back up. Sorry. Yes, I should have -- the easiest will be if I type it up on the screen.

[http://skr3.nlm.nih.gov/SemMedDemo/]

If you can get one up there.

Okay. And there it is at the top. Let me just quickly go back to this. I will adjust to another. Those are all of the arguments extracted from the predications. We are taking all the arguments and sorting them in frequency of occurrence. Any other questions? Let's see. Are we ready to draw? Yes, that is a good question. It is not of course really up to me. It is library policy. So I really cannot say for sure. This you can use right now, but as far as when you can really do a search, I just can't say. We hope, I don't know, a year or two maybe. But okay. We have number 142466. >>