Event ID: 967291
Event Started: 4/2/2008 8:19:37 AM ET
Please standby for realtime captioned text.

She is right but the drawbridge is coming up and you guys are on the other side. The what you to take advantage of that fact and get to know each other really well provoke this is Europe support network for molecular biology network resources. We are here also, and there is a large group of librarian's now forming out there if you go to meetings to see more and more of this occurring in the libraries. This was a big push from the National Library of Medicine at one Point to get medical libraries to be the centers for bioinformatics support on medical school campuses. This is where this whole thing came about from. Unfortunately, due to the budget cuts, it might not go on for much longer. I think we have enough people out there who have a connection and the knowledge to keep it going within that growing cadre of people. If you keep those connections alive and get in touch with each other and with us, we should be able to do it. We are going to have you introduce yourself so you get to know each other from the beginning. We will introduce ourselves as well provoke my name is a David Osterbur. I am the head of access and public services, which means circulation, enter library loan and reference. I do a lot of things at the Library. The [ indiscernible ]-There are a lot of exciting things going on and we will talk more about that, probably on Wednesday when we talk about librarian involvement in this whole thing. Countway has its own--We are really pushing in that direction. I have a great job. It is very exciting, the things that are happening there. The Countway is a great place to work. I have a Ph.D. in genetics from the University of California, Berkeley. I was a practicing geneticist for some time and then decided to switch because of frustration with grant funding, another funding problem with the government and became a librarian because I always like libraries and always spent a lot of time in libraries as a researcher. I have been able to put those two things together and it has really worked for me. It keeps me involved in the science. It is not a 24 hour a day job. That is good for me.


You may have to remind us that it is time for a break. I tend to forget and just keep on going.


If we get your questions wrong, please feel free to correct as part of the other thing is that I will be sitting back there in the corner. If you have a problem and do not want to bring it to the front of the class' attention,--See if I can figure it out. It is something that everyone is having a problem with--Tomorrow he is doing [ indiscernible ] modules-[ AUDIO/SPEAKER NOT CLEAR] it is going to be informative. I am thinking it is pretty warm. I will talk to them about that. If there is anything like that--for three days, you will want to be comfortable. Are there any questions? A I teach, and I am sure that create an also, we teach a Ph.D. in level classes on how to do this. I tend to go fast because I am used to them understanding what I am saying. If you do not understand or have a question are click on something too fast or go somewhere and do something that you did not see me do, raise your hand and let me know that you did not see that. Either one in the back will help you or I will go back through it and go slower.


I am terrible with names. As long as you sit in the same place every day.



I would like to know a little bit about what your biology backgrounds are, too.

[Class participant speaking]..

We don't actually offer a structured set of objectives for this 3-day course. For some people it's way too overwhelming, for others it's not enough. Check in with yourself how you're doing. If there's something we can help you with, or you can help each other with [ Speaker Faint/Audio Unclear ] I will say it's actually easier in my opinion to develop a basic teaching program after these three days than it is to become a [ Speaker Faint/Audio Unclear ] in three days. Because the diversity of what you could be asked to interpret in terms of [ Indiscernible ] is a much broader scheme of things. On day three we'll show you some of the workshops that your colleagues have already developed. [ Speaker Faint/Audio Unclear ]. You will find it easy to see what you feel comfortable with after three days. [ Speaker Faint/Audio Unclear ]. You need to have a basic [ Indiscernible ]. I think after the three days you will be able to pick pieces out what you have learned to do that. Um, so the teaching people I feel we can help you with that. The reference, that's hard, even know. Even after I've been in this program for seven years, people will still bring me a question, I will have no clue what they just tried to do in this experiment and how I'm to help them. [ Speaker Faint/Audio Unclear ]. Remember that you are experts, you have databases to search. [ Speaker Faint/Audio Unclear ]. All of that kind of stuff that we actually tend to do. The tool is probably capable of it doing what they wanted it to do, but they only used the basic [ Indiscernible ] of the tool. Feel good about your expertise, you will invest the time in those things that your scientists might not.

Exactly. Librarians like to search and everybody else likes to find. Most scientists use a tool like BLAST as if it's Google. And they do. They go in and put their thing in and push the I'm feeling lucky button. They don't pursue it. What I will show you, probably 90% of the people at your facility, if they're research faculty using these tools, 90% will not know what you will find out about these tools. You can feel confident that you will able to show them new things, I can guarantee that.

[ Class Participant Speaking ].

I think we are recording, I could see the mic responding on the screen. Okay. Um, this class is set up online as web pages. If you have done the prerequisite stuff, you've seen the course. You can go to pub med and then go to NCBI, you can scroll down to education, here. And we have the medical library association course. That's where we will start. First is built in modules. What is that?

[ Class Participant Speaking ].

Go to pub med. Pub met.gov. -- pub med.gov. Then go up in the left-hand corner and click on the NCBI logo. Okay? Scroll down on the left, you will see education and click on that link there. Okay? And then, um, on the left side is the course. And we're now in the modules. Okay? Everybody here? We're going to start with the molecular biology review. Just like with any other field, a lot of what you learn when you are in a new field is the vocabulary. It's in a new language that you learn to speak so you don't have to spend so much time trying to explain what you are talking about. There is a video, let's see. Here's your reading, what you of course all did, right? Okay. -- which you all of course did, right? It's been a while since I've looked at this. [ Speaker Faint/Audio Unclear ]. I don't know how to get back. Sorry about this. I think we won't --

[ Class Participant Speaking ].

Yeah. Try to keep your sound down, there's going to be a lot of the same thing being broadcast here. It's, um, go to the next. And then just click on one or the other. I guess I could have clicked on windows media player, that might have worked.

We'll keep sharing. All sharing means is that the people in the room can see what you are doing, so.

Right. It still says that you need Adobe flash to play it.

Should you be downloading it?

It should be opening up, actually. Yeah. Let's just skip the video for now. It's just a general introduction to what I'll talk about, anyway. Okay. These are the terms we will talk about. The central dogma of biology is -- does anybody know how to naive -- does everybody know how to navigate the slides? The central dogma states that DNA makes RNA, makes protein. Um, Francis [ Indiscernible ] coined this back in 1958. In 1953 they discovered the structure of DNA. In 1958 they were beginning to look at how DNA is responsible for what is happening in the cell. And so, lost my [ Indiscernible ], um, one of the things that Francis crick has said that is he regrets naming it the central dogma. At the time he didn't understand what a dogma was when he said that. Um, this is all basically true. But that's not all there is to it. Each one of these steps, I have another slide, when I teach from my own PowerPoint slides I have another slide that shows [ Speaker Faint/Audio Unclear ], thanks. I have another slide that shows all of the things that have been added in here since Francis crick made this basic statement. In 1970 there was an article published in nature that said that the central dogma may be a slight oversimplification of the actual state of things in molecular biology. Basically, what we will learn about in this set of things here are these three steps. Okay?

We're going to learn about the structure of DNA, the structure of RNA and how it works. And the structure of proteins and how they work. We will talk about transcription and translation, all right?

Good to know where all of these pieces fit together. Here's a typical eukaryotic cell. I will explain the difference between prokaryotic and eukaryotic in just a second. This cell has a nucleus. Outside in the cell are other organ Els. There's the golgi, the mitochondrion, the ribosome. Some of these components that are outside of the nucleus also contain DNA. They code for their own DNA. There's interesting evolutionary reasons that we'll talk about later on.

Basically we have two different types of cells, eukaryotic and prokaryotic. Eukaryotic cells are cells that have a nucleus. Prokaryotic cells do not have a nucleus. They're bacteria. They generally have a single chromosome, which is a large circular chromosome within them. Eukaryotic cells may have hundreds of chromosomes. All within this nucleus. A cell was originally described as a sack of liquid. Essential it is, 95% of you is essentially water. But a lot of the materials within the cell [ Indiscernible ] and each one of those things that you saw in the last slide, the end dough plasmic reticulum. The end -- a lot of this machinery that you see in cell is devoted to that central dogma function, DNA makes RNA, makes protein. One of the differences between a eukaryotic and prokaryotic cell that is somewhat important is that within the nucleus in a eukaryotic cell you make the RNA. The RNA is produced within the DNA in the nucleus and transported out into this area to make and be translated into protein. You don't have to do that within a eukaryotic cell. So when you are producing your proteins it happens much quicker. You don't have to do this transport. There's not a lot of processing of the RNA. It's a fast process. That's why an average prokaryotic cell under optimal conditions with reproduce every 20 minutes. Whereas a eukaryotic cell most of the time, the average human cell will take 24 hours to replicate. Okay?

You hear a lot of talk about genome. What is our genome? The human genome is composed of all of the DNA within a cell. All of the chromosome and the DNA and mitochondrion. That is our genome. Plant cells have another organ eel that also has DNA. It's composed of the chromosomes plus the mitochondrion, plus the color plasts's DNA. Humans have 22 pairs of [ Indiscernible ] and 2 sex chromosomes. Altogether we have 46 chromosomes when they're in the dip loid state, one pair of each chromosome. Okay? That means two at each, is what diploid means.

Getting into finder detailer -- fineer detailer here. Within the nucleus most of the time you don't see chromosomes looking like this. They tend to not be so compact until they're ready for cell division. You see a chromosome kind of uncoiled to a certain degree within the nucleus. That's because the function of that chromosome is such that it has to be somewhat uncoiled in order for transcription to take place. Okay?

Here's a chromosome. At one end, at each end, actually. You have a [ Indiscernible ]. They're important because of the way that DNA replicates. One of the things that I left out, or that was left out of the central dogma was the DNA repcation. If you've got DNA makes RNA, makes protein, you also have DNA makes DNA. So when a cell replicates it has to reproduce its own DNA. It has to reproduce that double strand, okay? Maybe I should go on a little bit before I get into details like this.

The function of the [ Indiscernible ] is very important. Especially in states like cancer. Where you have a cancer cell because there is a protein that is necessary for a chromosome to replicate. If you don't have a protein called [ Indiscernible ] the repcation of the [ Indiscernible ] won't occur properly. You keep cutting shorter and shorter an that chromosome. You don't replicate all the way to the en. One of the things that has too be happen at a cancer cell this protein called [ Indiscernible ] comes back on in a cancer state. It allows the [ Indiscernible ] to be replicated properly. Otherwise you would chop down until you got into necessary coding regions and the cell would die. There's a structure in the center, or it can be anywhere along the chromosome called a centero mere. This is highly repettive DNA. This is where when a cell is undergoing repcation there are microtubules that attach to the center mere area. The two [ Indiscernible ] are pulled apart when the cell divides. Okay? The reason that cells can hold -- if you were to string out all of the DNA in a human cell you would have over six feet of DNA. At the molecular level that's a lot of stuff. The reason that cells can do that is because they have proteins called his tones that are responsible for compaction of the DNA into these, um, structures. So DNA winds around his tones to compact and be able to fit into the cell nucleus. Okay?

This is not what everybody thinks of, this is the double helix. When you think of DNA you think of the double helix. You have two complimentary strands. They're each composed of a sugar phosphate backbone and [ Indiscernible ] that pairs together in a certain way. That rule makes it very easy now for us to do matching. It makes it easy for the cell to do matching as well, in order to replicate that DNA. A always pairs with T. C always pairs with G. Let me move on before I tell you about different strands. This is more detail on what DNA looks like. Here you have the phosphate. Here you have the sugar. This is rye boas. Here you have the [ Indiscernible ], okay? If you look at the structure here, you can see that there's a difference in orientation. This point on the sugar is pointing down on this strand, it's pointing up on this strand. That gives the directionality to each of these strands. Enzymes that are responsible for transcription and for translation and for DNA repcation all pay close attention to this orientation. You can code for proteins with either one of these strands. That enzyme that, um, transscribes the DNA will be going in this direction on this stand for one that codes on this side. Okay? That's a general rule. Five prime to three prime. You also notice here that the AT connection is, has two hydroJen bonds. The GC has three. That means that DNA that has a high proportion of DC is going to be harder to melt apart than DNA that has a higher proportion of AT. That plays a role in a lot of the molecular biology that is done because, for instance everybody know what PCR is. It's a way of replicating large amounts of DNA in a test tube. You have to know some of the conditions for melting the DNA apart to get PCR to work. You need to know about the composition of the DNA to know to set the conditions for melting.

Here's an even more detailed view. Now you can see why it's called deobjectiony rye bow. -- deoxyribo. Here's an oxygen attached to this phosphate group. This is how the keeks always go. In RNA there's another oxygen that comes off here. Okay? One of the differents between RNA and DNA is the this. That oxygen is there in RNA, it's missing in DNA. That oxygen plays a big role in how stable DNA is versus RNA. Because that oxygen is sticking up here in RNA the RNA can't compact as readily. It leaves itself open. RNA is much, much less stable. RNA is hard to work with in the lab, there's an enzyme that is called RNA's that you can boil and it will still function. That is prevalent in many cells. There's a reason for that. That RNA is responsible for a lot of what happens in the cell, as a cell is going through differentiation it produces different RNAs as it goes through. If the RNAs are persistent the cell wouldn't be able to move on to the next step, you would still be producing the same protein from that RNA. It's easily gotten rid of in the cell.

Here's where the numbering, five prime to three prime comes from. The numbering is based on the carbons in this sugar molecule. The fifth carbon is the one that the phosphate attaches to up here. The five prime to this three prime direction here. Okay? Any questions?

Okay, let's move on. I have no idea, either. [ Speaker Faint/Audio Unclear ]. [ Laughter ] There is a standard for nucleotide base codes, that's important. You are all librarians here. You know how important it is to have metadata that is standardized. This is for nucleotide. ACGT add in [ Indiscernible ], the cell uses [ Indiscernible ]. Everybody knows these four. There are others, this comes about because of techniques that have been used to look at sequence. Sometimes it's hard to distinguish between different nucleotides when you are running a sequencing [ Indiscernible ]. You have to make a call, you can call it, if you can distinguish between A or C [ Speaker Faint/Audio Unclear ]. That video that we missed shows you what a sequencing jell looks like. We'll talk about why these things happen.

Okay. We've talked about chromosome and genomes, what is a gene? Functional and physical unit of heredity passed from parent to offspring. In eukaryotics it's composed of a stretch of DNA that includes things called introns and exons. We think of a promoter region which a five prime to that gene, we have the start of transcription, which is a signal within the DNA, and then we have what are called exons that are coding region within the DNA. You also have within eukaryotic cells and certain types of bacteria called [ Indiscernible ] bacteria he also have introns. Introns are sections of the messenger RNA, the RNA that is produced that are produced out before that [ Indiscernible ] can be turned into a protein. Okay? It's an interesting evolutionary, um, question. Why there are introns in genes. If you look at the arcky bacteria it's interesting they are the only group of prokaryotic that have introns. It's thought they've been put into a straight kingdom completely because of this fact.

Go back one. It says here at the bottom the function of introns is not yet known. That's true. Um, but there's some interesting things about introns. You can actually have a gene that transscribes in this direction that contains this entire piece of DNA. Within this you can have another gene that transcribes in that direction. It's a very complicated, structurally it's very complicated of where things can be with relation to one another.

Okay. Here's the regulatory regions. This is the appropriator region -- promoter region. There has to be some way of turning it on and off for a gene to be transcribed. It happens in this five prime area of the gene, called the appropriator region. These play several different roles. They can open up the structure of the DNA to allow an enzyme called RNA P to come in and start to transcribe. It's down stream from the promoter region. Then you have a down stream region, the three prime end. Sometimes it's complicated here too because you can sometimes have, um, transcriptional activity. You can have transcription factors that bind at the five prime enand the three prime end. This is what is thought of as the structure of the gene, typically. Okay? Once this is produced, once you transcribe that RNA it's called messenger RNA. To transcription is the process of making RNA from DNA. Many scientists have devoted their entire careers to looking at that one piece. How is this particular gene regulated? What are the signals that causes it to be transcribed? We're getting to the point now where we can start to physically understand what is occurring for a lot of the gene transcription processes and control of the transcription starts and why proteins are expressed differentiately in cells. Now we can start to look at how we can drive a particular cell down a particular differentiation pathway. Because we can understand what is happening with these controls of gene regulation. How to turn a gene an and off is important for how a particular cell differentiates. If we have undifferentiated cells that we can direct down any bath that we want we'll be able to do a lot of important medical work. Okay?

The messenger RNA has a structure. Within the genomic DNA you have this area where you started transcribing until you hit that first exon. You have this area that is not translated. Then you hit the exon where it is translated. Then you have a variety of different structures for different genes. And the three prime area. The primary transcript of the messenger RNA looks exactly the same as the genomic DNA. If terms of sequence structure. But that structure gets processed, as I said before, these introns get sliced out, they get cut out. The two ends are joined back together to make this one mature transcript. Okay? In the, um, records that you look at at NCBI this mature mRNA is called CDS.

Cells can be very efficient at how they put things together and how they use their energy. One of the ways that they've been very efficient is what is called alternative splicing. So a single gene may produce more than one product. Because you can splice out pieces differentiately. If I was to splice this piece out completely I would have this transcript down here. If I was to splice this piece out between 2 and 4 I would have this transcript here. Which has left out part of the coding region and part of the functional piece of that protein. This protein that is produced from this messenger RNA may have a completely different function than the protein that is produced from the other messenger RNA. Okay?

This is normal function. Many different, I always like to think, if you look at proteins you've got lots of what are called conserved domains. There are functional domains that do certain things within proteins, they do it within lots of different proteins. It may be a DNA binding domain. There's a particular protein DNA that is used in lots and lots of proteins. If you, if you put that homobox into a protein with an enzyme here on a particular charge on a protein over here it will do different things than it does without that charge. So sometimes a DNA binding protein can be a transcription factor that starts transcription. Sometimes it can be something that stops transcription because it's blocking it. Okay? Those functions can be traded back and forth. If you look at, especially in a bacteria cell. They're under a lot of pressure to take their Joe Nomes small so they can replicate quickly. In bacteria you will often final four or five different functions that we need, every cell needs it to be alive, several of the functions that we have a single gene more each, you will find all five in a protein in a bacteria cell.

It's inside of the nucleus. All of the processing for the RNA takes place inside of the nucleus. It does get touched. You have the mature RNA, but there are things that are done to the five prime and the three prime ends of that transcript. There are caps put on the end of a transcript. You add, um, a polly [ Indiscernible ] tail on the en. There are things done to help to make it more or less stable as it's release out into the sigh toe plasm.

[ Class Participant Speaking ].

Yes. Yes. It doesn't have to be. That processing could start as it's being transcribed. There are some, um, [ Indiscernible ] genome is relatively small. There are some genes that are introns that are 70,000 bases long. So, because there is a physical amount of time that has to take place for that transcription to occur. It's thought that these long introns may play a part in the timing. If you start production of an enzyme as a particular point, the processing it takes, it's about 60 nucleotides a second that it can transscribe. It can take up to an hour to do 70,000, just to cut out. It seems like a big waste. But it actually is timing. Setting a timing pattern for production of a particular protein that is necessary for one of the steps that has to occur later on.

[ Class Participant Speaking ].

Yes. Yes. It's called processive. It means base by base. Okay, once we've produced our message. Here you see we have our RNA, which is not as tightly coiled because of the extra oxygen in there. Think that these things have physical structure. They really do need to fit into spaces. They need to have relationships to each other. And they do physically interact with each other. That single oxygen atom plays a big role in the structure of this RNA. That's very important for, when we start to think about protein structure. These things really do occupy physical space. That's important for what they do. Once we've produced our mature transcript and put it out into the sighto plasm we can translate it. Change it from RNA into protein. That's done according to a code. Every three nucleotides produces one codon, each codon codes for one amino acid. Okay? So your messenger RNA will have three times the number of the nucleotides than the protein it produces will have amino acids. Translation is outside of the nucleus into the sigh toe plasm of the cell. A bacterial cell can transcribe and produce protein on that same [ Indiscernible ] as it's being produced.

[ Class Participant Speaking ].

The codon is three nucleotides that will be translate police department into one a.m. -- translated into one amino acid in the protein. I think we may have a codon table. Might make it a little easier to understand. Oops, sorry. [ Laughter ] Here's the genetic code, okay? You can see it's a fairly small and simple table. It took a lot of work to put this together. Here we have our nucleotides. Here's the [ Indiscernible ] instead of thigh mean. If we were to read a messenger RNA starting here. If I said I had the UUG. That particular combination would produce trip toe fan. Okay? UUG, lieu seen, I'm sorry. Here is UUG. There's three, um, nucleotides that go into producing that one amino acid. Okay?

We have what are called start and stop codons. Every protein starts with ma thigh neen. At the start of every translation start site there's an AUG. Ma thigh nene is always first. There's stop codons. They don't code for any amino acid. Because they don't code for any, no more protein is produced, that's the end of the protein.

[ Class Participant Speaking ].

No. Actually, um, there have been, there's been lots of stuff done looking at transcription in electron microscope. You can see, it's difficult to capture because those are delicate processes. The binding of the proteins and things like that, in order to get them down on to a thing that you can see you disrupt a lot of it. You can capture the images of a messenger RNA being produced from a piece of DNA.

[ Class Participant Speaking ].

Yeah. Well, there was the hypothesis first. When people knew about the structure of DNA they said, okay, how does this play a role in proteins? There had to be a way to do it. Everybody knew there was lots of RNA in the cell. There was small pieces of RNA always present in every cell. I will show you, um, let's go back. We'll talk about, we'll talk more about the physical process in just a few minutes. So every one of these transcripts has a start codon. And a stop codon. So you are producing this string of amino acids from this codon set. I always use the single letter amino acid code, it's easier to think about three to one. Each one of these three letters stands for one amino acid. If you look at this you will see that there are some combinations like UCA, or UCA or G, there's redundancy in this coding table. You will also see that oftentimes, if I have, um, CGE, or any of these I will produce an argenine, oftentimes you will see what is called wobble at the three prime nucleotide in the codon. Because they can code for so many different things. So, the specification of that third base in a codon is not as important as the first two, usually. It can have a lot of mutations within a string of DNA in that third base pair position that doesn't change what the protein that it produces is. A lot of, go ahead.

[ Class Participant Speaking ].

Um, it's thought it might have been the other way. RNA is thought to maybe have been the fir nucleotide acid that was used for life. Thigh mean is simply, I think, matches with the double helix better than the urisil does. It fits in the combinations better.

Okay, so this redundancy means that we have, every cell normally codes for 20 amino acids. Okay? There's a standard set of amino acids that every cell uses. You have 64 codons. That's it's redundancy. Here the wobble hypothesis. I didn't know this slide was here. So, it says here if any of the codons are read backwards most will code for different amino acids. You have to know which strand you are reading from to know what you are going to produce as far as a protein is concerned. When you sequence a piece of DNA you have no idea weather it codes for a protein, second, which of the strands it codes for is coding for the protein. And third, which of the reading frames is used. You have lots of different possibilities. You have six different reading frames that you can read from on any given messenger RNA. We could start our reading here. At ACT and produce [ Indiscernible ] or lucine, depending on which frame we were in. Okay? If we're coming from the other direction the same thing, you're going to produce different amino acids in that chain depending on what reading frame you use and which direction you are going. Okay? So will are programs that will do translation for you, you can put in a sequence of DNA and ask it to translate for you. Generally it will start with all six frames. You have to look for what the most likely possibility from all of the six frames is, based on what you know about the function of the protein, and what you know about how many different codons you may see. Generally with any given piece of DNA the longest region of coding is the correct one. Because there are 3 out of 64 stop codons, you get one about every, on average, you get one about every 18 base pairs. So if it's not selected against you will see a stop codon. If you are out of frame you will get a stop codon more often than you get an he can tension of a long open reading frame. -- get an extension of a long open reading frame. It's just kind of statistics. Out of that group of coding, you have three different stop codons. If you are simply, if you are looking for an open reading frame it's easy to first discard reading frames because they have stop codons that won't allow you to get a long enough reading frame to produce a protein, because they're long enough. On average one every 21 amino acids.

[ Class Participant Speaking ].

If you are just taking a piece of a new piece of DNA and trying to figure out if it codes for something, first look for the first open reading frame, okay?

[ Class Participant Speaking ].


[ Class Participant Speaking ].

There are mutations called frame shift mutations. One nucleotide is deleted. You have shifted the frame, you have shifted that reading frame by one nucleotide. Therefore what that protein will be after that point is different than what it would have been.

[ Class Participant Speaking ].

It can, definitely.

[ Class Participant Speaking ].

Generally, um, generally not that often. Simply because of the fact, if you've got good sequence you will not see frame shift because you have the entire sequence. If you pick out the longest open reading frame from that you are generally safe. That's all bench work. This is all computer work. So --

Okay. Are there any questions about transcription and translation?

[ Class Participant Speaking ].

There are 20 amino acids in every cell that are used, yeah.

[ Class Participant Speaking ].

Yes, every cell needs all of those 20 amino acids.

[ Class Participant Speaking ].


[ Class Participant Speaking ].

It's not that you have to remember the 20 codes, or that you have to remember how the shift would work. Just when somebody comes to talk to you, you can kind of go -- oh, that's some of the vocabulary [ Speaker Faint/Audio Unclear ]. Not that you have no mem miz any of this. -- memorize any of this. [ Speaker Faint/Audio Unclear ].

[ Class Participant Speaking ].

Messenger RNA. Chris mentioned there's different coding tables. Generally those are for mitochondrion DNA. The chromosome of almost every organism used almost the same code for producing its protein. Mitochondrion sometimes use different codes, it's not a major difference. They sometimes use a different code for producing protein. Generally, one of the stop codons has been turned into a coding codon.

Okay. We've gone through that work to get to this point. DNA makes RNA, makes protein. The same with coloro plasts. It's only mitochondrion that seem to be different.

[ Class Participant Speaking ].

Yeah. Here's our goal, to produce a protein. Proteins are the functional units of the cell. They do a lot of the work. They make the structure. They're the enzymes that do the physical labor in the cell. So what are they? How do they get put together? A protein is a polymer of amino acids. It's a string of grouped together amino acids. What is an amino acid? It is a chemical structure that looks like this. Each amino acid has an amino group, an acidic group and an R group. It's the R group that determines the properties of the amino acid. It's the R group that determines the essential properties of the protein that our groups as they're aligned that determine the properties of any protein. Those R groups can vary a great deal. In terms of size and properties.

Here's our standard know men clayture for amino acids. You don't have to memorize this. Amino acids have different properties. They are usually based on several things. One is whether they're basis or acidic. Another is whether they're nonpolar or uncharged polar. Um, another is simply size. Some R groups are quite large.

Here you can see these are actually not to scale. Here you can see some of the differences in, tripofane has this large R group sticking out here, versus something like alanine which is here. It's a CH3, simply. Okay? [ Indiscernible ] and [ Indiscernible ] are the acidic. Again, you don't have to learn this. Just wanted you to see the variety of R groups here. Some of these things have the properties of actually affecting the structure of the entire protein. Something like prolene is an amino acid that what prolene causes the protein to be shifted in structure. Those kind of properties are good to know a little bit about, because, if you are looking at certain functions prolene can disrupt some structures. If you've got a substitution of a lucine, for a prolene, for instance, there's a lucine zipper, it fits into DNA as a binding protein. If you still a P in the place of a L it makes a kink, it can no longer fit into the DNA groove, it doesn't bind anymore. You can predict when you look at a sequence of an amino acid if you've got a snip, you know that it codes for something. You know that coding has changed the amino acid that would be there. You can predict some of the differents and predict why you might see a [ Indiscernible ] mutation within.

I told you a little bit about conservative domains. A domain is a functional subunit of a protein. We're going to talk some during this whole course about a protein called mute L. This is a protein that is i Kateed in colon cancer. It is a DNA mismatch repair protein. If you get a mutation in this protein you tend to accumulate mutations in that cell a lot more than you would. One of the conserve domains is this mismatch here.

We'll go into some of these. The Cn3D program is a way of looking at these. Here we have the DNA mismatch peridot main from a variety of organisms. So human, then [ Indiscernible ], mouse, fruit fly, [ Indiscernible ], the mustard plant, [ Indiscernible ] the bacteria that causes Lyme disease, you can see the alignment between those. This top one I believe, I can't tell, it's alignment. You can see the conserved regions within these proteins. That conservation is not based on identity, it's based on amino acid properties. Okay? Sometimes conservation is identity. Sometimes it's properties.

[ Class Participant Speaking ].

Okay. Um, let me see if I can pick one out that's -- here you have, so, um, at this position you've got --

[ Class Participant Speaking ].

This is argenine here. Except for this [ Indiscernible ] here. Those are, this is a conserved domain. You're going all the way from human to bacteria here. This alignment is covering a huge amount of evolutionary distance. Each line there is a different species. Yeah. Anytime you see an [ Indiscernible ] number or a GI number there's a new sequence. We'll talk a little bit how to look at the structure of a record and the results results that you see and how to look at the results, we'll discuss that later.

Proteins have structure. That structure is very important. They have four different types of structure. The primary structure which is simply the sequence in that polymer. You have secondary structure. Secondary structure is the difference between two different types of structure, ba that pleated sheets or [ Indiscernible ], simple structures within the protein. Tertiary structure is how that protein is folded in upon itself. Quatenary is how two proteins that are folded interact with each other. Oftentimes a functional unit may be composed of two or more different strings of amino acids, different proteins. Some enzymes will have four or five different strings of amino acids put together to make a functional unit. Hey mow globin is composed of four different heme globin proteins based around an iron center.

[ Class Participant Speaking ].

Some would call it a single protein. Basically, some will call this a peptide and this or this the protein. It can vary a little bit.

[ Class Participant Speaking ].

Um, boat of those, actually. And another one. Because of the physical properties of the amino acids, some are charged, some are not. They will fold in relation to each of the properties and the environment that they're in at the time they're translated. Some get translated into the endoplasmic reticulum are going out to be in the membrane of a cell. That environment for putting, making the protein is much, has more polar charges in it. It will cause the protein to fold differently than if it were just in the sigh toe plasm. You can't just take a string of amino acids in a peptide and predict how it will fold. We're getting close. There's a contest every year, somebody solves a structure of a protein. They don't publish it. They give that structure, they give that amino acid string out to groups who want to compete for this prize. They make a prediction of what that structure should look like. Then they bring out the structure and compair the results. Most of the time they're up to about 60% right. But I think as time has gone on we're getting closer and closer to being able to solve the structures from the amino acid sequence itself, but not yet.

You can .

Okay, so human genome includes about 30,000 genes, and as we've seen, it can produce more proteins than 30,000, because of alternative splicing. Gene expression, turning on and off of those genes is a very specific process that occurs in each cell, and that's what drives cell differentiation . Here you can see in a brain cell, we've got three different proteins, epinepherine,myosin andamylase. Epinepherine is expressed in the brain cell,myosin andamylas e is not. How many of you know epinepherine is a hormone, it's myosin is a protein that's found in your muscles generally, and am ylase is an enzyme that breaks down sugar.

It makes sense that you're going to produce this, epinepherine is a hormone that causes, actually is epinepherine the same as what? Anybody know? What is epinepherine some I've forgotten! It's a stimulant of some kind. I've forgotten what it is! But it makes sense that it goes to the brain, if you knew what it was, I'm sure that it would make sense. Myosin is produced in the muscle cell and amylase so your salivary glands produce the protein,mylase, it's secreted in your saliva and that's what breaks down the sugars,s sugars are broken down already in your saliva, where you swallow, so and NCVI has some very nice tools looking at gene expression, we'll show you those a little bit later on.

So genes can be turned on or off. Genes can be expressed highly or a little, and each of those things can be controlleded in certain ways to cause a cell to differentiate into a particular cell type or to cause it to function in a particular way. Here you see Gene A may be expressed in both muscle cells and nerve cells, higher in muscle. Gene B may not be expressed in muscle or nerve but be expressed in skin and Gene C may be expressed in all three. You've got different combinations of these 30,000 genes that are going to determine the unique properties of each cell.

You can also have different gene expressions at different times. Now, as I said, oftentimes, cancer cells express genes much differently than a normal cell does, and this particular gene may be prolime ks race here. You may have turned on them as one of the properties of that particular type of cancer to allow that cell to go through cell division.

it may be one of two things. It may be that because it's expressing in a cancer cell, it maybe that there's another gene expressing that it interacts with, that it normally wouldn't interact with in a normal cell or it may be that you spliced that particular protein so it's got a different [INAUDIBLE} in it.

Are we going to take a break? You guys ready for a break? Okay, what else do we have to do this morning? I've forgotten. It's time for a break!


Okay, let's take, um, 10 minutes? Is that a good break?

Okay, I think we're going to resume here.

Okay, let's get started. Chris pointed out to me during the break that because we didn't watch the video, there are some terms that were used on some of the other slides that you may not have known anything about. If you see things on the slide that you don't understand what they're referring to, ask, okay? Don't hesitate to ask. If I don't explain it, make me explain it.

I think we should, hopefully we'll be able to watch the video after lunch, so you'll see at least see what you should have seen earlier.

Okay, we're going to talk a little bit more about sequence, but from a different way. We're going to talk about variation. Variation in sequence. Genetic variations are differences in the DNA sequence due to any number of things. Mutation is what causes genetic variation. Genetic variations occur in more than 1% of a population would be considered useful polymorphisms for a genetic link to analysis. Linkage means what chromosome a particular trait maps to.

There are different types of genetic variation. There are single nucleotide polymorphisms, SNPs, generally referred to as SNPs, and there are full databases of SNPs out there. There are small-scale insertions and deletions, and there are a great number of those. They work on looking at the relationship between chimpanzees and humans showed we were about 98% identical, but that was based upon did M a to DNA so taking human DNA to chimpanzee DNA and putting it together and looking at what percent a single one was not hybrid. However, that small scale insertion/deletion, there are a lot of those in both of the gee Nomes, and the hybrid did not show those small scale, so we're actually not quite as related to chimpanzees as we thought. We're about 97%. {LAUGHTER}. So it's a little bit lower but it's not too bad.

There's polymorphic repetitive elements, and that's what's used in many of the court cases that you hear about. Where they're looking for, this is used for forensic evidence a great deal. And there is what's called microsatellite variation. We'll talk a little bit more about that later.

Mutation is a permanent trans missable change in the DNA sequence. It can can be an insertion or a deletion and it can be an insertion or a deletion of single nucleotide or a large chunk, or it can be an alteration, and there are different types of alterations that can occur, and I'll show you some of those.

Here are some of the different types of mutations . You can get deletions of whole pieces of DNA, so it simply shortens that chromosome. That's a pretty major mutation. You can get duplications of entire pieces of chromosome, and you can get what are called inversions, where this piece of chromosome comes out and gets twisted around and reinserts in the opposite orientation, so it simply flips over. Whatchanges the relationship of the genes on either side of those insertion points , it changes that.

You have insertions where you take a piece from one chromosome and insert it into another chromosome, and you can get translocation where you take an end piece from one chromosome and insert it into the end piece of another, and oftentimes you'll get, this is called a reciprocal trans location. You've exchanged the two ends of these pieces of the chromosome.

So,alleles refers to genes, not whole chromosomes.

[Class Participant Speaking]

Yeah, it's that term is often used loosely, because it can describe sometimes scientists will refer to a trans location as an allele. It is not really that, it is simply a mutation. Allele refers to variation within a gene.

[Class Participant Speaking]

Yes, large pieces of the chromosome can be taken out, lost, reinserted in different orientations. This happens, because of a process called recombination. These two strands of DNA, the two pairs of chromosomes that you have actually exchange pieces on a somewhat regular basis, so every time a new generation of people is born, there are Newcomb but nations even within a single chromosome because of what's called recombination between those chromosomes, but in order for recombination to occur, you have to make breaks in the chromosome to exchange those pieces and sometimes if those breaks are not repaired properly or they occur at a time when the cell is undergoing a process that it, like pulling apart, you lose pieces of of chromosome where you can exchange because you have broken one here and broken one up here, and then it comes together in this way instead of reinserting as an exchange. So recombination is generally one of the big reasons that you get those kind of big chromosome mutations. A single nucleotide polymorphisms and the smaller mutations are usually due to, they can be due to any number of things like chemical, could be exposed to a chemical that could cause mutations, if you smoke it can cause mutations. Sun light can cause mutations, and you can see that with basal cell carcinoma.

So, let's take a lack at sequence mutations. If we look at what can happen in a sequence, so think of these as yourcodons, as they line up, " The fat cat saw the red dog". If we insert a letter in front of but we keep our codon reading, you see that you're going to throw off that reading frame a great deal. It becomes a nonsense sentence. And those are actually called nonsense mutations.

If we take the f out, you see that you also change the reading frame here and another nonsense mutation . Or if you translocate the E and the T, it only affects that one codon, but it does change that sentence.

Down here, it talks about the most common mutation insist ic phone rose us is the loss of a single codon, so it's a loss of a single amino acid out of a gene that contains 250 thousand nucleotides .

this is the expanded list of terms. We aren't going to go through these today. But you can use these to go out and explore on your own, if you want to learn more . And the Q arms of a chromosome, how many of you know what that is? Most chromosomes are differentially, it's not exactly in the middle of most chromosomes have you have a short and a long arm of a chromosome. Each arm is called, one arm is called QE and one arm is called Q, for short and long and the banding on a chromosome is numbered, so you have E21 would be the 21st band on the P arm of whatever chromosome they're looking at. I don't remember exactly, I don't use it anymore. It's not used that often. If you look at the map viewer, you'll be able to see which one is which. P is short?

[Class Participant Speaking].

Oh, right, right. I think they're counting from the centromere out.

[Class Participant Speaking].

Those are the bands, so if you look at human chromosomes generally they're stained with a stain and that stain is actually staining the DNA where it's compacted differentially on those chromosomes, and so you're picking up, it's different proteins that are aligned on that DNA and you're picking up the differential standing because of the proteins that are aligned there, and it's characteristic for each one, so it's characteristic for each species, so you've got banding patterns that you can identify . I don't know if I'll be able to show this. I guess not.

Okay, so here is your firsthands on experiment. Try using NCBIsORF Finder to translate the following sequence into protein. So open up your computers, get to this slide, .

[Class Participant Speaking].

Can you translate what the that means?


Something between a person?

So this is the human version of a Drosophila gene. Okay?

Why would you call it that?

Because when you do a sequence alignment, it was originally found in the fruit fly, and then they found it in humans based upon the sequence that was cloned Drosophila, so the name stands for, it's a circadian rhythm gene, so this is a gene that's responsible for a 24 hour cycle in the fruit fly. It's also responsible for how the frequency of their Winkler beat in their courtship mating. We have the same thing in humans and the mutation in the human genes do cause circadian rhythm problems so people who don't sleep well can have these mutations.

I'm going to Pub Med.gov first.

[Class Participant Speaking].

That's all that it's supposed to do. Go to NCBI up here. Click on the logo. Let's turn this over, I think. And then click on NCBI. Okay, so now go down to education on this left-hand side.

It's under education. And what we want is the Medical Library Association. And then we want the modules and then we'll go over here to the slides and we'll do --

[Class Participant Speaking]. What was that?


There you go. So now what you want to do is open this up, it will open in a different tab, or a different window and you want to cut and paste that sequence into that tool .

Okay, everybody got a result?

Which one do you think is likely to be the right frame for this coding sequence? You can see how this is the longest open reading frame that was found, but there's stop codons that come up in the reading frames here and here and probably quite early. You have to have at least a hundred amino acids to draw something . This is actually, I was especially when I'm being recorded, {LAUGHTER}, I hate to criticize NCBI for their tools like this, but this is actually not the best one of these out there. There's lot thes, the European micro biology labs has a better translation program that will, it would give you actually the amino acid sequence so you could see, you get to pick which is the starting translation and it will give you the sequence right out of that for the message.

[Class Participant Speaking].

Right. That's because you're simply picking out that open reading frame here, okay? This is the -- we haven't put anything in here to identify this particular sequence, so that's why you're not seeing anything up here. It simply gives you back the sequence that you've got that you put in, but it puts it back in in a way in the standard N CBI format. So it doesn't identify it because we haven't done a sequence alignment with it.

Now --

[Class Participant Speaking].

You'll see why they don't do that. That's tomorrow, actually.

So, the these are the strands that you're reading on so you're reading from both strands so it's plus and minus strands with the Watson and Crick strand, whatever you want to call it, plus or minus coding and non-coding.

[Class Participant Speaking].

It's looking for codons, so all it's doing is translating those codons in every frame into amino acids and seeing which one gives you the longest open reading frame.

[Class Participant Speaking].

Well the database is what I showed you, that little table of of codons,

[Class Participant Speaking].

That's right.

I don't understand --

Those are blank because up here, it specifys that you have to have at least 100 amino acids in a contiguous sequence in order to get it to list.

And so the first letter, this link represents the whole link --

What you put in and so the first reading frame here would start with the codon with 1, next one would start with 2 and the next one would start with 3 and in those first two, it didn't find open reading frames that were at least 100 amino acids long. Maybe it's not 100 amino acids, maybe it's a hundred nucleotides long, but that's not a very long sequence.

Okay, any other questions about the open reading frame finder? So we're looking to answer, first you want to answer , does this piece of DNA code for a protein. Right?

[Class Participant Speaking].

Yeah. She's, she just wants to know exactly what we did and why we did it, okay? So we're looking to see if this particular piece of DNA can code for a protein, okay? That's the first thing that we want to know. This says yes. Okay? So when you stick it in there, then it gives you that graphic back. what this means is okay, so in these first two, let's see if we get anything different if we change that so in the first two, you had stop codons that kept you from reading across all the way, so there were no contiguous sequences that gave you that length that was required for the minimum to put it in. I changed it down here to 50 and so now I can see more. The reason that we think that this one is the best is because it's the longest out of those six. So this is in a continuous length of amino acids starting with this nucleotide which would be probably, I have to show you, which would be this right here and reading in this direction. Okay?

[Class Participant Speaking].

Right. And that's an important point. All of this, the work that you do on the computer, is simply going to help you determine what experiments to do in the lab. These are not answers. These are not things that you can say, okay, this is true and this is what I wanted. You have to go back to the lab and verify it. No matter what you do, in Silico, you've got to go back to the lab and verify it.

[Class Participant Speaking].

Right. So this would be the first nucleotide, the second nucleotide , the third, where you start the reading frame or from this end, okay?

[Class Participant Speaking].

It's the right one? Then you'd take that, your sequence and you can take that amino acid sequence, you now have an amino acid sequence that's that length. You can run it in lots of different tools to determine the properties of those, of that peptide sequence. Okay? I'll show you, we'll show you some of those tools as we go along, and you'll see the kind of things that you can do with it.

So, there are three fields, the fields of study that have come out of a lot of this reent work. Genomics, proteomics, and bioinformatics. Genetics has been around a lot longer, it's one of the basic fields of biology but these other three are fairly recent and fairly new. fwchltenomics is the Study of not just of single genes but of the whole function and interaction, and out of this kind of capability, has now come a new field called Systems Biology, where you're starting to put together all of the gene interactions into pathways that you can look at, and how those pathways are regulated so Systems Biology, you probably, you may have a Department of Systems Biology now at your place of work.

This is the definition of genetics, the Study of single genes. Okay, proteomics is a small set of proteins that are encoded by a Genome. Bioinformatics has lots of definitions and it can be an extremely broad term because it's an extremely broad field, and a lot of people now refer to bioinformatics, not just the molecular biology part but they're bringing in medical as well and calling it biomedicalinformatics. We now have a Center for biomedicalinformatics.


[Class Participant Speaking].

The Director of the library, takes about the bibliome -- bibliomics.


Because he does a lot in the literature. So if you look at bioinformatics, it's an interesting mix of things because there are the people who are on the hard side. They build those tools that everybody is using. They do the analysis to put together those tools and put them out there for people. Bioinformatics are actually one of the probably most open sourced dproop of people that you'll find anywhere, because almost all of them build their tools and just stick them out there on the web for anybody to use, so there's now probably close to 4,000 different databases out there with different things in them that you can use, all of these different tools to analyze those databases. People are just sticking things up all the time, and it's really cool, but it's also very complex and knowing what of those tools are good and how to use them. I think librarians will be very good at, if we can get people up to speed in learning how and why people would want to use these tools, so we're good at analyzing things. We're good at looking at how databases work and why they work the way they do, and I think that's one of the areas that could really help people is to put up a set of these tools on to your websites that people, that you recommend for people to use, and I'm starting to do that. It's hard because they change so fast, but I think it's important, and I was actually thinking about doing a survey of my faculty just to see the variety of tools that the they are using in their own labs, and put up a list that says, " Here is the tools that Harvard Faculty it use".

And also the other side of things, where the majority of people who use these tools have no idea how they were built. It's like most of us driving a car. You don't know how that car runs necessarilyly you don't know what all of the computer pieces are that go into making that car run. You don't know how the car ksburator works particularly, but you still can drive it, so I think what librarians kind of have to act as is the mechanic. We have to be there when they have to understand what the tool is doing so that because they're running into a problem with it, but we have to be able to interpret what those builders meant for that to do and what the people who are using it are using it as. So we're kind of the independenter mediary. That's how I place myself in this area .

Okay. This is the end of this section. Remember, we covered this this morning? Yes, okay. So we're going to do quickly, types of databases. We're running behind.

at NCBI, there are and out there on the web, there are lots of different types of databases and if you think about it in one way, you can put them into these data domains. So they're based upon what they contain, and so it's the same kind of thing that we've already looked at, nucleotide sequence, protein sequence, 3D protein structures, so we've looked at all three of those things and complete nomes and maps. So these are our domains and you now also have gene expression which is becoming much bigger. I'm sure most of of you have heard of microarrays at some point, so lots of labs are running microarrays for thousands of different reasons. A lot of the gene expression data is coming from those types of analysis with microarrays, and they're looking at SNPs, genetic variation, polymorphism, SNP databases.

You can also think of them by the scope of what they cover. So these three databases, Gen Bank, EMBL, and DDBJ, are part of a international consortium that pool their data on a daily basis so they synchronize their databases daily to make sure that everyone contains the same information. They do different things with that information, but the basic database is the same across all three of them. And then they build their tools to do the analysis of what's in those databases, so Gen Bank this year is going to celebrate its 25th anniversary. It's been around for 25 years.

You also have protein databases, so protein is NCBI's tool. If you go to NCBI, you can see here on the drag down list, you've got protein, nucleotide and core nucleotide, okay? The core nucleotide database is the one that you probably would use if you were looking for a messenger RNA for something, although I have to tell you, we're going to spend a lot of time on these different database, but there's one database that you should no and that's Entre Gene. That brings all of this other stuff together, and I think if you don't know any other databases , if I go to -- I misspelled it.

The reason that you want to concentrate on Entre Gene and direct the people that are asking the questions, is because this is where NCBI is really doing its work in curation, in putting in the metadata that people need. If you look at the history of Gen Bank, Gen Bank started off as a place where people could deposit their sequence data. So that meant that anybody could put sequence in there. There was no control on what the metadata that was added to those sequences were, and over time, it's kind of become a mess. If you do a search in Gen Bank for something, you're going to find lots and lots of redundancy, lots of sequence that may not be as good as you want it to be, so there are problems with Gen Bank that NCBI is now in the process of fixing, and the way they're doing it is to go in for each one of the genes, pull out what they think they consider to be the best representative sequence for that gene, or set of sequences for a transcript that has multiple variations, bring it in here, put in the coding sequence, put in the protein sequence, and link it with all of the other database s that they have over here. Plus, lots of outside databases, so if you go here, you'll see what I look for was the amino transfer, which is an enzyme that deals with amino acid and it puts amino acids on to transfer RNA and we haven't talked about, I forgot to talk about how translation actually works but we'll get to that when you see one of the videos later.

So, if you look here, you'll see that the name that came up is actually " Got one the" so we got one. And it brings in, so it talks about here is the organism, you can link it out to the taxonomic database from here and it tells you, to see the related sequences, the related genes in different databases. It tells you the entire lineage for any organism that's in the database. It gives you a summary of the gene function right here. Any other name that it's known as in the database here, so if you search this by an old name, you should come up with the new name and all of the information about it. It tells you the genomic structure of this gene, it gives you the chromosomal context of the gene, it tells you which chromosome it's on, which band it's in, and what the relationship is to other genes surrounding it so here you see the direction of transcription, relative to other genes surrounding it, and one of the cool things that I like is gene rifs which is Gene References Into Function. Here they allow authors to put in, if they feel like they've got a paper they want to submit here to talk about the function of this protein, then they can put it in here and NC BI looks at these papers and decides if that's appropriate and they pick them up for people to see. So each paper that's here is supposed to actually talk about the function of this particular pro P teen.

They have links to markers that are on the chromosome surrounding this gene. They have links to pathways and the KEG pathway. It stands from Kioto encyclopedia of genes and genomics, and --

[Class Participant Speaking].

This is a pathway, so you see, you can see that you can get into the past ways.

[Class Participant Speaking].


Actually, they're over here. Do I have to go back to that?


Okay, I see it. Okay, so I think, I'm using Fire Fox and the default on this computer is Internet Explorer.


Okay. We'll go through more of this like Chris said later on, but these are tools that your faculty it first of all aren't going to know are here for the most part. They will not know this database, because it's fairly new, and not a lot of people use it. They go to their regular old, look for the nucleotide sequence and there's nucleotide database and that's what they bring up, okay? But those sequences can be brought up, switch back and fourth, is that goingto cause a problem? Can't you just open this up in Internet Explorer then? {LAUGHTER}. Doesn't matter, I lost it anyway.

[Class Participant Speaking].

Thank you.

Have them call.

Okay, we're back. So, we're going to go through some of these other databases and I want, what we want you to learn from these is what kind of information is in them and how they have been treated as we go through, so that when you come to a Resource, you you can choose the correct database, because that's part of knowing how to use a tool. It's part of the searching that the most, again, most of the faculty have no idea that there are different databases behind each of these tools that they can choose, and that they will get different results from.

Okay, so we have the protein databases, Entre Protein, swit-Prot is a database from EMBL originally put together based upon amino acid sequence that had actually been proteins that had actually been sequenced as the amino acid sequence rather than as a nucleotide sequence and translated. So, it was the prime database up until it couldn't keep up anymore because people aren't sequencing amino acid sequences anymore. So now, it's called Swiss-Prot which stands for Swiss-Prot, the original database plus translated EMBL database so they put the translation back into this database but it's still a very high quality database of protein information.

The protein information Resource is also quite good and U ni Prot is also quite good so the protein databases tend to be very well characterized and fairly good, but I think again, we're going to show you some databases that you're going to want to let your faculty know about for them to go in and pull out the sequences they're going to use.

There are specialized databases for protein structure, MMD B includes everything that's in the protein data bank, plus data from some other places that's pulled in, and it's the molecular modeling database. This is what's used by NCBI to you'll up its structures when you look for protein structures. And Entre Genome is the only one talked about here but there are lots of databases out there as well and I think we'll probably show you a couple other ones that there are particular aspects of NCBI's database that I like and there are particular aspects that make it very hard to use, and so we'll switch back and fourth between the different tools that are better for particular purposes.

[Class Participant Speaking].

If you look at, I'm going to actually go back there, if you look at, if you go to NCBI and you pull-down the list here, it's about quarter or half the way down on that screen. So you can see that you can pull that up. Any of these databases if you want to see them just as the entrance point, if you just hit the return in the search box with that up there, you'll see the entrance point to the database.

There are also organism specific databases. You'll see lots of those out there. And I'm sure you've seen some of them. We were just talking about fly base at the break. That's one of them. And there are categories or functions, these are TRANSFAC and the Vector database, TRANSFAC is a tool that you can look up where a transcription factor would act if your r gene may be functioning in, which impact may be regulating your gene of interest, it will tell you sequence information about where a particular factor binds and here are some of the terms that we missed from the video. EST stands for Expressed Sequence Tags. GSS: Genome Survey Sequences. Sequence Tagged Sites and high Throughput Sequences. We're going to look at what those are and where they come from and why they're special.

[Class Participant Speaking].

The other issue is theres a public version and a commercial version and some are licensed for a more commercial version, but you can use it with other groups, so if you do selection in the library and you're wondering about selection development, [INAUDIBLE}, which package is the decision, and until you start using the public ones you may not know what the difference between the value of the public and private is, so it could be that you need to find a register and explore that because eventually the question will come, well what am I missing by not having the other version and if you're doing collection development, you want it there.

I think that's fine. That's one of the things also we'll talk about on Wednesday is theres a lot of, I mean on Friday, {LAUGHTER}.


There's a lot of interest in what tools, what should a library be supporting, and so if you are a library that plans to do bioinformatics support then you might think about licensing some of these databases because they are among the best of the tools that are out there. It is, every January, there's a list of databases, no, is it every January?

[Class Participant Speaking].

January is the database list, and then the web resources is in July I believe, first issue in July. I can't remember.

[Class Participant Speaking].

What the?

It's the web resources that are out there, so tools that are out there for free that people can use, Biointomatics. January is the databases, July is the web Resource tools .

So some of these things, like ESTs are not talked about quite so much anymore as they were originally. Expressed Said Wednesday Tags comes from the fact that you can take a messenger RNA from a cell and using an enzyme called reverse transcription, turn it into DNA, and once you've got that piece of DNA, you can clone it, and you can sequence from the ends of that message, okay? So that means that we've got tags now for that particular cell type. We can use to go in and we can do a number of things with that. One is you can can look at all of the things that are expressed in that particular cell type from the sequence. You can also use those as markers within a cell so that you've got a molecular handle on a particular place within a chromosome and these were used a lot in the beginning of the human genome sequencing project.

The same with sequence tag sites. These were short pieces of sequence that you could go in and use as a way of getting into a particular region of a chromosome. Okay? So, originally, the ESTs were included in Gen Bank as part of Gen Bank, but because so many ESTs were put out there, it used to be that when you searched, you did a search against Gen Bank, all you would get back was ESTs which aren't necessarily that useful for a general purpose thing so they put the ESTs into a separate database now, so if you want to look for ESTs to use as markers with a chromosome, you can use the EST database or the Sequence Tag sites that are also in that same EST database.

This is the most important part of this whole thing is the level of Curation. So you have two different types of Curation that are here. You have archival data, which is simply a repository. It's the place where everybody puts things and that's Gen Ba nk. With an archive of everybodies sequence that they stuck in there along the way. It's redundant. Even though when you look at, I can't show you right now, we'll show you tomorrow, if you run a search in the nucleotide database or in the protein database of NCBI, it says nucleotide NR, which stands for non-redundant, but that non-redundant is a historical term that comes from the fact that they only let each lab put in one copy of that particular gene sequence. So it was unique to each lab, but lots of labs were doing lots of genes and there's lots of redundancy in there and all of the different projects are in there, so you've got ge nomic DNA, Messenger RNA, you've got EST sequences, you've got lots of different redundancy in there, so Gen Bank is very redundant and that's one of the problems with it and as I've said before, there's also no controlled vocabulary. There was no one controlling that metadata that people were putting into there, so if you search in Gen Bank by keyword, you almost don't find things because people don't put tags into those sequences. They weren't interested letting people find things that way. They were interested putting the sequence in so they could publish their paper.

And there's a lot of variation in what is included in terms of an notation of that gene, because one lab may think it has one particular function because that's what that lab thinks about and another lab may find the same sequence that has a different function because that's what that lab is concerned with, and so there was a lot of variation in what was said about those different genes. But NCBI is putting together this non-redundant, one record for each gene or splice variant database, called Entre Gene, and the database itself, the database of those sequences is called Reference Sequences but those are the best sequences that NCBI has pulled out of this whole group of sequences from Gen Bank.

Each record is an encapsulation, so it brings all of that information together and the records contain value-added information that has been added, so like the gene rifs is one part of that, and Entre gene links out to other resources that NCBI doesn't have. So, Entre Green is an excellent Resource to start almost everything with.

This site says there's hundreds of databases out there. So here you can see that these are all databases. And this is just one of themetasites for databases. Deciding what to use is part of I think what we should be doing for our faculty. Okay, I've said most of the stuff here. You can get more information about what Gen Bank is. Not a the lot of people are doing this anymore, but you can submit things still to Gen Bank, and there's two different programs that you can use to do that. Band It and Se quin. Bank It is used by labs that are doing small projects and Se quin is what is used by genome project labs and Se quin is probably the one being used most now days. Very few people are doing actual sequencing projects for small, individual greens anymore.

So in terms of individual labs doing it, it's mostly large scale labs doing Genome projects that are submitting .

[Class Participant Speaking].

Depending on which one of those commercial databases it is, so, the protein library actually has PhD scientists reading papers, reading the primary literature and adding the annotations to each one of them. That is a model that is, it's essentially the chem abstracts model of putting a database together. That's used by both, by lots of different companies now, xwchltenuity is using it, Gene Go is using it. Well, they're taking the sequence information, that kind of thing from NCBI, but they're taking the primary data right from the primary literature, so they VP had did scientists reading each of these, reading the papers and annotating those things.

REf Se q is the database to use if you want to get probably what NCBI considers the best said Wednesday information, and they've used the Ref Se q record to pull together a lot of that so if you don't go talk to a gene, go to the the Ref Se q record for a gene protein, or MRA, so you get, you can recognize Ref Se q succession numbers by having these two letters, so this stands for NCBI protein, and then an underscore and five or six numbers, okay? So the Ref Se q record is distinguishable by its succession number.

The Ref Se q records are accessible like most other things that NCBI through BLAST or in a variety of other ways.

Entre Gene is a place that NCBI has put together the information about a particular gene and the protein that it codes or whatever that gene does. Rep Se q is the database of those genes and proteins. Okay? So if you look at an Entre Gene record it's got things all about that particular gene, and if you look at a Fer Se q record, it's got, let me show you the difference. I think it's easier to do this.

So here is the Entre Gene record. It's got all of this type of information so it tells you about the structure of the gene itself, it tells you about the map context and the chromosome, it gives you all of this other information, but down here, you've got the reference sequences and this record, so here is the protein record for that particular sequence, and now what you see is a regular E mtre protein page, but the difference between Ref Se q and any other protein that's in this database will be there's a lot more an notation here so you'll see you'll have a long list of papers about this, you'll have a summary of the function of the gene, you'll have lots of features that are listed here, and the sequence. Okay?

So Ref Se q is the database of the genes and proteins itself, the physical sequence.

[Class Participant Speaking].

It's more basic. Entre Gene brings all of the information about that protein or that gene into one place. Its function, it's all of those thing.


Yes. Yes.

[ Class Participant Speaking ].

I lost my -- okay. RefSeq contains, um, all of the different types of molecules we've look at. It has the Ginoic sequences, the protein sequences. And it's got, um, [ Indiscernible ] models from. I showed you that model of what the gene structure is for a couple of those genes. We'll go into more of that later on tomorrow. There are status codes. Because NCBI's still in the process of building these data bases there are different status codes for the cur ration. RefSeq is a curated database. And you have statuses, just like in pub med, when you see a record comes into pub med new it says in process. Later on it might say indexed. This is the same. You've got the provisional. You've got the ones that have been reviewed. So that is the same as being indexed. You may have a predicted, actually, for the structure of the particular gene. So sometimes the gene structure is not known because of the variance in splicing, there can be question about which ones are real and which aren't, things like that. Genome annotation, once you get to genome annotation you've picked out that region. You've got to the point where that's a very well curated record. [ Indiscernible ] is when they're first starting to put things together. It's not complete. A predicted maybe that the only thing that is missing is the gene structure within the molecular map. Okay? So provisional is a beginning, they're asking for bringing in the information, still reviewing what is going to be put in there. [ Indiscernible ] is more that they may not know exactly what the answer is.

[ Class Participant Speaking ].

Oh, yes.

[ Class Participant Speaking ].

It still is, actually. Um, we're very close to having it complete. But there's the thousand genome project going on right now. So they're trying to now go back and sequence is thousand different humans. Because what we had when it was declared complete, we had five-fold coverage. Because of all of the different, the way that sequencing is done, and because of the repetitive DNA in the genome, it was difficult to determine order within the genome. And figuring out because of that redundancy where particular things may be in relation to other things. So five-fold see consequencing is not actually enough to get a complete map of the genome.

I think what we'll do is go to lunch. We'll try to get the flash on to these machines to show you the videos when we get back, okay? What do you think, Chris? Let's make it 1:30 so we don't get too far behind. Everybody back here by 1:30, okay?

Session on lunch break until 1:30 eastern time zone. Session on lunch break until 1:30 eastern time zone. Session on lunch break until 1:30 eastern time zone. [ Laughter ]

As you can see we have sound now. We're going to start, there's two videos we'll have you watch. Part of what I covered this morning [ Speaker Faint/Audio Unclear ]. You will see a lot of that in the video. If you have questions after the videos, feel free to ask them. One part that I skipped is the adaptation part of translating, um, messenger RNA into proteins, there's an adapter that comes into play there, it's called transfer RNA. Will you see how transfer RNA works, what it does in one of these videos. Can't get too close [ Speaker Faint/Audio Unclear ].

[ Interference ] [ no audio for the captioner to hear, video closed captions are running at the bottom of the video ]

How did they address the fact that some of these samples of people they had might have had mutations or defects? That's one of the reasons they wanted to get a combination of different people because --

[ Class Participant Speaking ].

Different people.

[ Class Participant Speaking ].

I'm sure that --

Yeah. I mean --

It is a small pool. First of all, the purpose of the human genome project was to get the genome sequenced. It didn't matter about the variation. We wanted to get a baseline of information. The reason that it doesn't really matter, if there are mutations, every generation there's new mutations that arise. You all have mutations because of the frequency of, yes you, because of the mutation that occurs due to the errors of DNA prlimarase, everyone has at least one mutation, if not more, every time you replicate a new cell. There are going to going lots of mutations in these genomes. If they sequence at least five people they'll not all have the same mutation in the same gene in the same place. You look at the variation there. You try to make sure that what you've got, one person may have a defect at the place, the our four might be the same. You look at the consensus, that's what you draw as the normal. Okay? Any other questions?

[ Class Participant Speaking ].

What did we just do?

[ Class Participant Speaking ].

How did we get back to that?

[ Class Participant Speaking ].

I'm going to go back to the modules home page to start from fresh. You guys will do a lot of interactive work this afternoon. You might want to do that yourselves, as well. Real quick.

Yeah, I don't think any of your computers have the module's address set up on it. If you go to the NCBI home page, on the left there's a link to the education. It's on the first education page. You can also Google the title of it if you need to bring it up. Get that off of screen. All right, is everybody kind of vaguely getting to where we want to be? You're not connected at all, either? Okay. I'll get started up here while this works. You can feel free more of this. We'll go over the format of one of these Gen Bank records. We've been showing you little pieces of various records. I think seeing a whole one and going through it step-by-step by step. Being familiar with the fields and the data that goes into them. It will be a good orientation for us. The other thing is in the NCBI site map, if you come away with nothing else, at some point you take a look at the NCBI site map. That's how they try to structure adding the information that they have available. This Gen Bank sample record that we will go through is an entry in their site map. Where did we get to this, you can find it from NCBI's sight map. I will

So, this is NCBI's homepage, and on the left-hand side, do you see that it says " Site map"?

So there's all kinds of different Resource tools here, let me just maximize this for a minute. And what I am actually looking for is where they hid the sample record now. Let's go to the alphabet it call list and there's the Gen Ba nk sample record in that list. So for some reason, today --

[Class Participant Speaking].

Yeah, I think the alphabetical list is easier but you have to remember it's a xwchlten Bank record as opposed to a sample record or something else

This model is very straightforward and because it seems like I'm showing the module to the audience I'll be jumping back and fourth between the module window and the sample record window, I think it's better for all of you if you have a connection to go ahead and open the live record here and just you work with that window and the modules windows will be up here, so if you want to check up and see if I'm saying what's actually on the slide, you could certainly do that, but that you mostly just have the record in front of you because that's really what we're going to to use as a tool, and I think if I keep moving back and fourth between the two things I'll make you a little bit crazy.

So basically, there's really three sections to the record , and this first part that's, these groupings, I think are going to be a little bit deceptive because even though they 're calling them the Header and the Descriptors and the Sequence Data, the record you're looking at on your machine doesn't really call it that. So there's no search the header kind of section. There's search the title in terms of all fields for example, so the title in all fields is really this definition line that we're going to poke us on here and let me go ahead and open the record. So we're going to go through all of this detail and I think David had mentioned earlier that naming of these records, we're going to show you, they're based on what the scientists put in when they made their submission. We'll show you an archived record which is just literally reflecting what the scientists that posted to GenBank put into the record, and so in terms of all of the value adding things he was talking about that they do in EntrezGene or in Ref Se q, this first record we're focusing on isn't that kind of record. We'll get to showing one of those kind of records a little bit later. So we'll go through the specifics of where it came from, what organism it came from, which scientific features the researchers did bother to an know Tate because that's what they new about at the time, but as people they work with control L their could vab you Larry and adding value and all this, you may be looking at some of these records and thinking they could have added more keywords that would have helped or the keyword field might be completely blank, so you'll probably tap into kind of what the some of the informatiogaps in these records are and then we'll have the sequence itself and we'll have a look at that.

You notice here that it says in the archival database, all of the sequence should be experimental sequences, not the kind of hypothetical of predicted sequences that you all had a little bit of a discussion earlier of when you were talking about Ref Se q and you were talking about provisional vs. Predicted. Does everybody remember back to this morning a little bit? all of these are archival pieces where the scientist submitted the experimental sequence. These are not the predicted things that we're talking about.

So, let's just go ahead and look at this record and we're going to talk our way through this record, but this also happens to be the record that is that GenBank sample record, so if you forget what it is anything that we've talked about , if you actually go to view that sample record, what you'll see is that all of these things will have little highlighted things that you click on and it will explain what it is that you're looking at in the sample record; okay?

So don't worry about all of this up here. What we're going to do later this afternoon when we go over how you search EntrezGene and how you move back and fourth between the various databases, we'll talk about all of these different search features and pull-downs and what these things do, so for the moment, leave that aside and just focus on the piece of data that's in front of you in terms of the record.

So, let's talk about this first line. Can anybody identify anything up here that makes sense to them, based on what you've learned so far this morning?


[Class Participant Speaking].

The one you're looking at has how many?

1028 base pairs, okay, but what kind of molecule is it that we're looking at here?



Is there anything else that we can can tell from up here?

Yeah, it has a date, but we're going to talk a lot about dates. Dates are something that the we have to be very careful about. A lot of people are submitting to this either in relationship to as David was talking about publication or in some cases, depositing before or committing with their patent related work so NCBI is very sensitive about how it talks about dates so we'll come back and talk about dates much more.

Okay, so you've got basic information there. This definition line up here, this is what they're talking about when they talk about the title field so if they talk about limiting your search to the title field, most of the time they're talking about searching this definition line, okay? And a lot of what you'll see in the definition line, it's completely, and actually all of what you see in the definition line is completely unstandardized. I do think there's an effort to try to be more standardized. Let me not say that. Let me say that there's no absolute requirement that it be presented in a certain way. I do think there's certain conventions of behavior that people definitely try to follow. This is not abnormal to what you would see in the rest of the database, but there's nothing that says, for example, that you have to have the short name of the gene in parenthesis after a different name of the gene, as you're seeing here. So what you have here SUV the organism and some information about the gene, and different genes within it, and as I said they give you the longer name and the short name, and they said for some of these you have partial coding regions and for some of these you have complete coding regions; okay?

So some of this is a repeat so you see this is the organism but then you also have an organism field and in this case, you see that happens to be the same name. This confuses people quite a bit when they try to, and I will say something about searching real quick but we'll come back to this, people keyword search, you know, they say whatever gene, and human, all the time and they will think that means they're searching and limiting to the human organism, but many times they will have a gene that actually , you you know, it's a mouse gene but the organism, it was from, the organism that the actual sample was from are completely different, so we'll show you some examples where you can can search on colon cancer, human, but you come up with mouse things and it's only because in this definition line part of the naming is that the it was a human gene this , but it happens to be to the mouse and so the mouse gene name has the word human in it and vice versa so there's definitely a reason to limit your searches to organism if what you really want is the rat gene or the human gene or whatever your focus is.

Organism links to NCBIs taxonomy. Have any of you used NCBIs taxonomy before? Okay, a few of you so it represents only the taxonomy for organisms that are actually in GenBank. So there are many many many many organisms in GenBank but certainly not everything, so it's not a complete taxonomy and those of you who do a lot of biology reference probably know of other kind O of more complete tools, but for the ones that are in here, it's very connected with all of the other resources. In fact, since most of you haven't seen it we'll just pop over for one second.

So this is the taxonomy browser and we're not going to go through it now but I just wanted you to be aware that that's what it looked like and it's not a complete taxonomy. I think we'll see a little bit more of the taxonomy when we do the BLAST links tomorrow. I think there's a little bit of the taxonomy connection there.

So this is all of the background information about what you have. You notice the keyword field is completely blank, so they didn't use all of the opportunities they could have used to tell you more about what techniques they used or if they had any other retrieval help ideas for you. We do have multiple versions of this record and we'll talk about version control in a little while. Once a scientist submits to Theophano can always come back in and make changes and additions and corrections to their record so if they later on publish another paper they want to link up or if they sequence a little further along than they did before, that's going to be a different kind of addition, so there's lots of changes they could make to the record. So somebody could come back and look at this and go well, I fael really guilty I didn't put keywords in the first time. Let me put keywords in this time and they would be able to do it. Everybody else can't do that for them. If somebody else did it other than the initial person submitting it, that would be one of the third party annotations that I mentioneded this morning that we aren't going to talk about.

Okay, so this submission has several references and the literature and you see each of the references relates to the whole length of the sequence. Sometimes what you'll have is you will have a reference where the reference was only talking about the first part of the sequence and then later on they had one that dealt with more of it but in this case, it was a whole set that Jennifer, right? The whole set of base pairs that Jennifer mentioned back on the subject of the papers involved. And these are all hard links, so if you were in Pub Med, these links of course will take you out to the references for the article but if you were in Pub Med looking at the reference for the article, all of them would have hard links back to the nucleotides database so that you could see this record, so that's that one-to-one linking.

In terms of how this got into the database to begin with, you can see that it was a direct submission from Terry Romer s lab at Yale, and you notice hat they've forced that direct submission moo the title journal author kind of framework even though it wasn't published at that time. And this makes sense because most people have to submit it before they publish it so they don't have originally the paper citation and then once it's published they go back in and say that it was published in science and I'm really proud and here is the citation and it's all good. Okay?

So, this is kind of all of that background information and then you get to the more [INAUDIBLE}, and this feature area could be really rich, depending on how much work the scientist has done on this sequence, or it can be fairly plain. Now, here, remember we said this was a record that had multiple genes in it because we knew from the definition line there were several genes involved? Everybody with me on remembering that? So now what you see is that you actually see the information about the different genes, so here is an example that out of that whole range from 1-52, between 687 and 3158 is this, does everybody see that, so this morning when your resources kept offering you that to and from number, those to and from boxes, does everybody remember that, where you see to and from? So like if I was working with that gene and I knew it was 687 to 3158 but I didn't want to try and copy and paste out just the part from 687 to 3158 I could have pasted in the whole thing and said only look at the part from this place to this place and in this case you already know what this is so you wouldn't be taking this but I'm just using that as a to and from example. And later on, when we start working with Entre, you'll see that you have a range box up here from begin to end, so if you were saying, look, I only want you to look for things that look like that piece of it, you could type in the range of it so that you wouldn't have to deal with the part that you don't want. Okay? And there's many many strategies to get at that. That's just one of the ways you could do it, okay?

The other thing that these do, you see all of these little parts that start with the back slash some these are all NCBIs way of labeling these features and this will become important to your searching later because things like mall type and all of that, when you look in the preview index, how many of you have used preview index in Pub Med? When you start to look at the names of things and how they label them, it will be helpful to kind of cross reference between what they're talking about, so you just kind of start looking at how they talk about things. What else do we know about this from the information and features? I'll give you a minute or two to look at it and see what else jump ps out at you as what we into about this from this record great. And what else do we have, what are we looking at here? We're looking at a translation, so we have, when we jump down you're going to see that we have the original, and there's quite a lot of it, here is our 5,000 base pairs of A's and C 's and T's and G's we were expecting because we're in the nucleotide database and we know that this is DNA because the definition line said DNA, everybody remembers that? So, you would expect to see all of this data, but what you also have SUV information about the protein translations, you have links for the protein record, do you see where it says protei ID, and then you have the actual translation.

Here is the other gene. Remember we said there was also the Rev 7 gene, the second gene that was in parenthesis so we have that here as well and you can see that that is actually a smaller section towards the end of this whole string of sequence. Okay, is everybody comfortable so far with what you're seeing? Yeah?

[Class Participant Speaking].

Okay, that's a great point so Cara was asking what does it mean when it says " Compliment" and this goes back to David talking about or showing earlier about the complimentary DNA, and so I think this has to do with the reverse being on the reverse strand, so I think the previous one was on, yeah, the first one was not on the complimentary strand, and this one is. And right now, it's not such a big deal as we go through the record but you see up here at the top, you can reverse the complimented strand and you can actually, if you carry this moo some of the other displays, you can see which gene is on which strand and which gene is on the other strand. Do you want to add anything to that?

It can be both. Either they don't know what the rest of it is is or the rest of it could be, you know how he's talking about introns and exons? Like these genes are exons but what's on either side of it, it could be a lot of different things.

[Class Participant Speaking].

That's true too, right. So I'm just going to repeat what he was saying that since we're looking at DNA and not at messenger RNA, we have extra sequence which could be introns or it could be untranslated but that's a good question. And so okay, so one of the other things that I wanted to point out that came up this morning, we were talking about pasting in lines of sequences to things, and about whether people should just be copying and pasting and how your sequence was wrapping and you see that in the default GenBank display, they try to number these lines of sequence to give you a ballpark if you're trying to eyeball where did that 68 start, you can kind of start with 61 and kind of look over, but if you were actually copying and pasting out chunks of sequence and you you copied and pasted this and left the numbers in there, you would be causing problems with the tools, so what you want to do is you want to actually take it in the fast A format and you'll learn more about fast A tomorrow when we do BLAST, but basically what this does is it takes out all of the numbers and don't worry about capitalizing versus not, it's going to make no difference for this but you get two things. Does everybody see that little care" at the beginning? That is basically, this is a definition line an it just ignore, because the care " Is there, it knows not to treat it as part of a said Wednesday, it's like HTML, when you put something in the brackets it knows that it's a tag and it reads it a certain way so that thing we did this morning that said " Anonymous", it said that in terms of your finder results because there was no definition line, you just gave it a piece of unnamed sequence and said " Here you go". Had you had a definition line that started with the little c aret, that said sample fly sequence and you started it with the little caret, then this would come up and say " Sample fly sequence" so this happened to me a lot doing reference as people would come to meet with me and bring their sequences and we would start putting them in and comparing them or searching them and they wouldn't have definition lines on them and I'd be trying to keep track of was that file, you know, sequence 1.txt or sequence 2.txt, so encourage people if they bring you something down that doesn't have is a definition line in it, have them at least throw in a caret with something so that when you get your results back and you see, you know, whatever they want you to call it, it doesn't matter what they call as long as they give it some kind of name so you don't get yourself too mixed up, does that make sense to everybody? It may not be so bad when you're comparing two files against each other but when you start to do like multiple file comparison if they don't have definition lines you can get very confused very quickly. And you can see, it could be anything in any format. You could type in there, you know, my dogs gene sequence and it would accept it. It doesn't care what you call it.

And so now, you could copy and paste all of this out or you could use the send to text file or clipboard functionality that you're used to using with Pub Med. Also again if you didn't want to take all of it but you only wanted to take the complimentary part from 3300-4 '07 that Marie was talking about you'd say take this part to this part and take it from the complimentary side not from that side. Does that make sense to everybody so you can tell it to take only what you want or you can take the whole thing and deal with it that way. Is everybody with me on that? Okay. Anything else about this record that we want to look at?

Let me show you one more thing. Like all things in Entre, the links menu is going to be extremely rich, and one of the things that I think has been kind of interesting, this record actually doesn't have a very rich bed of things by comparison, but you know, it used to be that they would list out all of the things they had links to but as there became more and more different databases, shouldly they all collapsed and there's that tiny little thing that says " Links", but I've run into a lot of people that have never really noticed that link that says " Links" or never pull it down, so to be fair everything that's in here under links is also in the body of the work, so here, let's just go through that real quick. Because this will maybe orient you. So where it says in links gene, where do you think that's taking you? To Entre Gene and within the body of this record where you have genes, there should be built in hard links to gene. We'll text in Pub Med Central. Well, the question would be that you had multiple papers in here, right? So you have Pub Med and Central but remember you have multiple Pub Med papers here. You have two papers from Pub Med, does everybody see that? so you could go one by one and click on these individual IDs and go to one paper to one paper, or hopefully and let's just see if this works, if you were over here in links and chose Pub Med, how many should it give you? Let's see, does it work? Yes! Okay, something is right with my world here! And you see that one of these is free in Pub Med Central and that's why you have that.

So that's the kind of relationship you should be having. These should all be, where it's a direct link to a database like protein or like taxonomy, it should be the equivalent of however many matches there are so there's only one organism so taxonomy should take you to that one organism link. So beyond the hard links, then you get into the related sequences and that's no longer a direct hard link one-to-one. That's an algorithm that takes this sequence and uses the BLA ST algorithm that we're going to talk about tomorrow to show said Wednesdays that are related up to a certain cutoff score and then we'll talk about that a little more this afternoon when we do all of the different Entre searching, because some people like to do that kind of discovery where they use relateed articles or related sequences and some people will not touch that with a 10 foot pole because they don't understand enough about the algorithm or what the cutoff values are so they don't trust it, or why did it pick up this and not this so I think there's two camps there and it will definitely be the camp that is like I prefer to go out and do my blast search and set my parameters. This is designed to be kind of a pre-calculated quick kind of thing. Think of it like Pub Med's related articles but working on the sequence level, not on the word in the record level. Okay?

[Class Participant Speaking].

That's a good question. Let me actually go the GenBank display. let me go back. I think that the different will be up here in the reporting line but let me see. No, it's the same. When we go on break, I'll have to explore that. I don't, on the face of it, see anything that looks very different. Does anybody else know?

[Class Participant Speaking]. It would make sense. A lot of these, you see it has the features jump to and the sequence jump to? These are to allow you to navigate through the record so you can jump down to the features table or the features key as they call it or somebody who wants to jump right to sequence, and remember this is only based on two or three papers, right? Or two papers in the direct submission. If you have a record where they published a lot of references, the gene reference and functions that David was talking about this morning, that middle section could be really really really long, or that top section could be really long, and so somebody who just wants sequence probably does want to be able to jump all the way down to that so that's a good point, yeah, so Kevin we'll make sure we do that in the full record.

The other thing that is kind of hard to see is does everybody see up here between the hyperlink number and the kind of brief version of the display, that there's this little supports link? That's relatively new and if you click on that, you get two things. One is that you get the option to do a lot of the same display features. In fact I think 99% of that is duplicate to the display options. The thing that's not is the revision history, and so when you guys go to do your sample exercises for this module, there's a question about the revision history of something, and so keep this revision history function under reports in mind, but also when you go to do that, don't get overwhelmed by it because as I said NCBI has very particular ways of dealing with dates and they don't actually expect you to ever kind of come up with what the initial date of submission to NCBI was. Like they like people to e-mail them and they will check it for them because it's usually related to some kind of legal inquiry so the database is not reliable in terms of that first, let me just show you for this one. I think it will make more sense if I show you. You don't mind skipping around a little bit do you? Okay.

So you see here that it says this was first seen at NCBI on May 2, 1996, at 12:28. If somebody comes to you and says, " When was this deposited with NCBI"? Don't give them this date. Say you need to e-mail the information at NCBI and ask them and they will give them the legally approved absolute accurate date and time. This is a database generated thing, and so when it actually went out in the database world, it may not be actually when the time that they received the initial e-mail submission, you know, so this is not a legal answer. If you want the legal answer, you should go to NCBI for it and they will turn it around very quickly. They just say, for those of you doing commercial implications of the work, don't rely on this.

[Class Participant Speaking].

I think that's why they, I think they're also very concerned about that but the way that the system used to work with people like e-mails coming and people updating various pieces of it and how it got in and when it got checked and all of that, I think they're very sensitive to the fact that they can't present that data that way and so they don't want people to rely on it, but it's very good went to say that this is not your typical way to expect databases to behave.

[Class Participant Speaking].

Okay, so, the current version of the record that had been most recently updated is the version that's live and displaying in the database. I mean, dead is probably not the right word. It's just when you do a search of this database, it should only retrieve the live records. If you want to retrieve the dead records, you have to get them from the revision history so it's like saying it doesn't want to show somebody who is naive the super seeded record. It's only going to show a naive user who comes to do a search whatever version of the record is the most current record, that is the live record. Maybe obsolete would be a better term than that, or super seeded or something, I agree, but it means what we suspect it means.

And usually what happens if you go to one of these, it should tell you that what I'm looking for is where does it tell you that it's actually the super seeded record? It should have told us?

[Class Participant Speaking].

That's what I'm saying! You can also tell it to show you the difference between them, so it shows you what changed between the two revisions. So this can be helpful if you're trying to figure out if what has changed is a major change versus just some small thing, so like here, they used to have Med Line IDs but now they have Pub Med IDs, makes sense to everybody, right? Things like that. Any questions about this particular record, this very simple record? Good!

So I'm just going to open the GenBank sample record that we talked about for a minute, so you can see how it's set up so that if you have any questions about what we talked about before that you can come back to it. Okay, so this is the sample record that we got from the site map, the alphabetical site m. Does everybody kind of vaguely remember how we got to it? Everything that's in blue here has been hyperlinked with explanations so if you're like well what is this SCU49845, it will jump you down to an explanation of what that meant, okay? Or the sequence link, you guys were really good. You already know that it was 5,000 base pairs, but if you wanted to know more about the sequence link, you could jump down to that part of the record.

The main part of the record that I find myself frequently needing help with and this record just doesn't have a lot of them, but are in these features, so sometimes they will have a feature in here where I really won't understand biological , like what this is or what they're talking about, and this is just me. You know, people like David reading these records may go, Oh, I know exactly what this feature is they're talking about. And some of them probably seem more familiar after todays lecture in terms of where the start codon is for example, or what function a particular gene has, but there's been other things in this where I've been like, okay, what is that? And there is a guide that explains, so like for example, this little reverse caret, earlier I was showing that they meant this definition line is a definition, don't read that. Well, these mean something else in that part of the record, and you can see the explanation for them. So sometimes the same symbol might have different meanings depending on where you are in the database, and so going back to Marie's question about the compliment, so if you didn't understand what it was saying about compliment, you could come here and read it and it will explain, Oh, it's on the reverse side, it's going the opposite direction, here is how that's working, so, before I highly encourage you, when you go to start teaching, you're going to do a workshop on this, you know, pick a sample record that you're going to teach with and step through and make sure you really understand everything that it says because the question that somebody is going to ask you, and don't get me wrong, I like teaching where you run with what the audience asks you to search on, in Pub Med, I'm always game for that, like tell me what you're hooking for and we'll search on it live? Don't do that with this your first time!


Pick a record where you know the answer to everything so that you feel really confident, like pick something that's clinically relevant or scientifically relevant to the research done at your place and know what it is, like know it cold, so that not that you can't not know something and still have good vibes of confidence with your scientists but it really helped in terms of getting up for my first class to be like, I know this record inside and out. I know what every one of these features is. I know what that organism is. I can pronounce what that organism is. You know, pick something that you can feel really comfortable about and then let the other stuff kind of fall from that.

The sample record is a really good way to do that. I just don't teach with this sample record because this sample record is so unrelated to the work that happens with my people, that they would be like why are you showing me that? That's where you pick something where everybody is like yeah, we're really into learning about whatever this is so it's more exciting.

So we're going to finish going through the other records ands and then a break when they're setting up all of your cookies and things like that so we'll probably go for another 10 minutes and have you have time to do the sample exercises and then we'll take a short break.

Okay, so we had looked at the nucleotide record for primary or archival sequence that was deposited by a scientist and so what we're going to do now is look at a protein record that falls into that same archival situation.

Now, this is the static record that was captured at the time that this was done, and I'm going to just walk you through a few things here but frankly you could go live, you could just follow the link to go live in Entre, or however you want to do that. Okay, so what's different, you know, given that we just familiarized ourself with the DNA record, what's different here that we're looking at in this protein record?

[Class Participant Speaking].

Yeah, so your length is in amino acids. A lot of these do begin with A, but the lettering system, it's all off now, when they started out, I don't think they anticipated how big these databases were going to get so they started out with like lettering groups that made sense biologically and everything else but now they're just out past that and like whatever new combination of numbers and letters that has not yet been used is all good.

Okay, so do we have multiple versions of this record? Yeah, okay, and we don't know yet like what the changes were but we do have something.

[Class Participant Speaking].

Wait you have two things. One SUV A version number with a .1 which means the first new version, so that's the main way you could tell, I mean, you'll definitely see records where you'll have like version with a point, there's a .1.3.4, depending on how often the people like to touch their records, you might see quite a bit more of this. Did we get any helpful keywords from this scientist some no. Okay, what organism is this?

[Class Participant Speaking].

This is one of the ones where they're kind to us and you could type in human in the organism field and it will map to homo sapiens.

Okay, so here, when we talk about this, you notice that they're talking about residue as opposed to amino acids in the body of the paper. I don't remember if we talked about residue in the protein section this morning but you'll see that, sometimes you'll see residue instead of amino acids.


[Class Participant Speaking].

It's probably good she has the cell phone, {LAUGHTER}.

There we go. Okay, so, where were we? All right, so it's then talking about the whole length of the protein sequence, because it's talking about the whole set of residues. We have published article in what prestigious journal?


Yes, these people and they also have the direct submission data from the initial lab, and this is a little interesting. I actually am not sure what they mean in the comment. The comment field is another free form field that they can use, and this has a comment that says, " Method conceptual translation". And so, you know, we were talking earlier about what was experimental and what was not, and all of that, and that would be one of those things that I'd probably go back and say let me make sure I understand exactly what he means when he calls this rrd a conceptual translation, just to be sure.

So, here is an example though where you see a couple of things. So we have the product of this. This is the human MLH1 gene and remember we said these are archival, right? Does anybody remember when you did gene with David this morning some I don't think you actually looked at this gene yet, but do you think the name of the gene was H for human MLH 1? Some say yes, some say no. It depends on how far ahead you read in all of the modules before you got here, {LAUGHTER}.

So the actual maim of this gene now doesn't have the H in front of it. It just has the MLH1 and a lot of the materials you have in your thing but they aren't going to go back and make that change, so just because the gene might have been called this , that or the other, this is archival so whatever the person submitted is what they're going to call it. This happens with a lot of other, you know, a the lot of other things. They didn't know what it was so they called it this and once theyfigured out what function it had they wanted to give the gene a name that more related to the function that it had so there's a fween I always use that relates to Alzheimers disease and that makes sense to people but originally it was called the S182 protein, and so some of the records say S182 protein, and so that happens, because this is archival, nobody is tying together all of the namechanges. If you wanted something that tied together all of the name changes where would you go look?


All right! So we're getting like that teaching point across very well here. So here is an example of what I was talking about before. This particular gene was in the human organism, right?


But this note says that it's a homolog of the yeast gene, so somebody who was keyword searching on yeast, not in the organism field, but just keywording yeast would get this record, because it does say yeast in the record, even though it's not yeast the organism, it's just a homolog to the yeast gene so you see the difference between just the regular and when I say keyword searching I don't mean searching in that keyword field which everybody is leaving blank. I'm talking about next word searching, kind of non-field specific searching is I'll try to say text word instead of keyword so I don't get us confused but if I do you'll know what I mean. Yeah?

[Class Participant Speaking].

Okay in Pub Med it would be different because they don't actually have an organism field. Pub Med would have the mesh term for that organism but so you would search that organism in mesh but you wouldn't know whether it was about that organism of papers of of findings within that organism. I don't think you can search that way in Pub Med. What you would do is I think you'd do the Pub Med search and then follow out the link and then apply the organism field that you wanted to do that subset of results that you linked over from Pub Med, and we can do that when we go into EntrezGene searching later this afternoon. If a lot of you think your approach to it is going to to be from Pub Med to the said Wednesday databases, we can do that. In fact the first class I used to teach at Cornell, and Kevin I don't know, but the first class we taught was all about if you started in Pub Med, where would you go to all of these databases from Pub Med, like what were the hard links, related links but things like you're saying, what would you find in Pub Med and how would you transfer that into searching other databases and there's reasons you would want to do that and the two reasons we always gave about why you would want to do that. One is that if you needed technique information, that is you needed to find sequences that had been identified by particular techniques, a lot of the scientists are not putting their technique information in these records, but in the Pub Med indexing of the article, that says how they got this information, they probably describe in the Pub Med article level the techniques and methods and materials that they use to do it, so the indexers would probably cover, they had done this work by, you know, PCR like they keep talking about that today or by restriingz segment link, the technique information would be in the article so the indexer s might pick it up so you would search in Pub Med related to the technique and then follow the links out to the said Wednesday data that actually came as a result of that technique so that's one reason you might want to use Pub Med. The other is that when you get to things like human colon cancer that we're working on, you don't have mesh here. There is no mesh, so if I type in colon cancer it's literally searching for the word colon cancer so if everybody is writing about colonic neoplasm, they aren't, but if they were, or if you were using colon adanoma or something, you're not going to get those records unless they actually use those text words so that's another reason maybe to go out and search with mesh, you know, use a very broad thing that use your synonyms, use your mesh and then say show me the records that came up with that the search that have nucleotide records or Exw records or whatever kind of sequence you're actually looking for. So it's a good conversation to have and it's a good strategy. If you're an expert Pub Med searcher and feeling a little bit like well, how do I go from Pub Med to here I think it's a very safe strategy.

Any other questions about that? Okay. All right well let's just finish this up. Remember that wealth of protein databases that you were introduced to this morning, all of the Swiss-Prot and everything else some we're talking about protein records now and many of them as we've mentioned earlier were brought in from these other databases and so you can see that they give you the numbers in those other databases.

And then the same thing with the sequence again. And this time, it's not Aposhtions C's T's and Gs, so much, it's your various amino acid groups. These are all single letter by the way. You have the three letter groups but there's also single letter groups and these are the single letter groups. Okay?

So if we take this record live, and see if there was anything different --

[Class Participant Speaking].

You mean about these chunks?


That's a good question. The question was -- yeah. That's a good question. I don't know why, the question was why does the default GenBank display whether it's for nucleotides or for proteins ? Why does it chunk them the way that it chunks them and I don't actually know the answer to that because you saw when we go to get the fast A sequence we'll use for other purposes it doesn't chunk them that way at all. It's just a pure thing. Does anybody know?

[Class Participant Speaking].

Oh. So it's for a very practical, so it's about counting off in groups of 10, so it's not like there's any biological reason that it's 10. It's a human factors issue, okay. Excellent. Did you have a comment about that or you knew that's what it was?


[Class Participant Speaking].

That's just the first numbering, that's so that you can count them all together, so we said this is from one to what was it? 761? Yeah, so it's just to count it off, so the one is just saying they started from this number. That's completely arbitrary. Where this actually lies on the whole chromosome, you know, it could be the 60 thousand thing down on the list. It's only within the group that they're counting hat it's a one that is set at the beginning. That's a good question. Okay? Yeah?

[Class Participant Speaking].

That's a good question. swren rally, they seem to try to link everything as much as they possibly can can. I think you're right about the Swiss-Prot thing and generally this is mostly linking within NCBI although we'll start to look when we get moo more links, for example, like in Efw, they link a ton of things and that's one of the things I love about EntrezGene is they try to take advantage of the wealth of other resources that Davey alluded to this morning but I think this is mostly just a function of these things are all kind of defaults they' remaking cross references for. The note field could have nothing in it or it could have a lot of things and I think it's just a case they don't actively try to make hard links in the note fields. You might, I might be wrong and we might find some other note fields where they've done it. I just don't think they really bother.

[Class Participant Speaking].

Yeah, I think sdt others the pro sdes of handling all of those other parts is much more structured. I think you're right about that. This was 756 amino acids so it should match. Why they cut it off there, what, you know, whether that's the stop codon for something or for something else you would hope to get at from this feature. Like so I'm not sure if this was an entire, this is a good example. Remember we were looking at that frozen record where she had taken an image of it and it didn't have all of this stuff in it, because at the time they hadn't I guess finished, they hadn't done all of this extra work about what regions perform what ways, so this is, you know, an example so now you have all of these different binding sites and conserve domains they found within it so that initial record we showed you was early on in the development of this now appears that they know a lot more about these different parts of it is going back to your question about, you know, why did it stop there? I'm not really sure whether it stopped there because they thought that was a significant stopping point and so what I would look in here for is what stops at that point, and you see it says here that it thinks that that whole link is this protein, okay? So maybe that is why they stop there because the link of the protein in terms O of that reading frame or whatever was that, but you see there's all of these other things that are kind of different links and different regions within that, is I think I look at this and see what they were thinking when they put it in there and whether any of them give me any indication. Sometimes I think it's just a convenience thing honestly. Like whatever sequence link they end up with, with that, they put it in there even if they haven't explained what every single piece of it is yet.

[Class Participant Speaking].

Thanks for clarifying that. Okay, yes?

[Class Participant Speaking].

So a conserve domain within a protein is a piece of, it's a piece of it that is, it has carried through evolutionary, and they either know what it relates to or they just know that it carries through so the conserve part of the thing is just that it's been carried through or it's been conserved over generations or across organisms either from an evolutionary point of view or from I guess I'm trying to think of a word I would use for that so that's part of it and then the domain part of it just means I'm trying to think of the unit term that you guys used. The exact definition that David used this morning I'd have to go back and look.

[Class Participant Speaking].


So this is the link out from that particular conserve domain, so this particular function of being a binding site. The other one I'm just going to go back and open the other one. This is the mismatch one that you were talking about, right? The mutL and so this should look very Pam ill to what you guys saw this morning the multiple sequence alignment, so that you can see kind of how things have carried through. This conserve domain database is one of the few places where you'll actually see multiple sequence alignments in NCBI tools. I think one of you asked on your introduction sheet, how do we help people with multiple sequence alignments and actually, we probably won't show you how to use it today but when we introduce later on the tools from ENBL that David Was talking about this morning, they have a tool which is kind of frequently used for multiple alignments and that's probably the most common one unless people are licensing other tools that they use, in terms of free tools, that's the one that I use the most.

Because I think other than the conserve domains and the PO P, if any of you are familiar with the Po P Set, which is the population data set of these resources like if you had samples, I'm looking at our AIDS vaccine group, if you look at some of the population data sets that have been submitted to GenBank and it's like 25 people all with virus and these are the viruss from each of the people but they tie them together as a population so you can see kind of the variation in the virus and there's a lot of these in here for herpes virus, for HIV and a lot of different groups, that Po P Set also has a multiple alignment group because the whole point of the population set is to look at variation or similarity across the group of different viruss , so those are really the only tools that have that kind of thing built in I think. They did add a new one.

[Class Participant Speaking].

Oh, the blast link that that as well. You'll show that tomorrow, right?


But you can only view that for what's already here. If you had your own sequences that weren't already in here that your colleagues wanted to compare against each other, you you would need one of these external tools.

Okay. Let's go ahead, I think, and look at the Ref Se q record and then we'll take our break. I'm sorry we're getting a little bit off schedule here. Is everybody doing okay?

Okay, trust me once you get past this part the rest of it will just kind of flow from there so what we're going to do now, we've looked at the scientists submitted archival records and now we'll look at the crated records that NCBI has pulled together, and I'm just going to go, I think, live with the record anyway. Did we get it? Okay, so, if it's a rurated record that's bringing together a lot of things from a lot of different peoples archival records what would you expect to see more of in one of these reference sequence records?

[Class Participant Speaking].

Yeah, you might expect to see more version informational though the Ref Se q in and of itself is a new record bringing together all of the other records so it may not have that many versions itself because it's bringing everything else together but in this case you're right. We do have multiple versions. What would you expect the characteristics O of a reference sequence reference to be? Yeah, Cara?

[Class Participant Speaking].

Yeah, you'd expect to see a lot of references, and you just can scroll down and scroll down and scroll down, and this one actually doesn't have a, you know, a huge number. I think we only have maybe 15 references. There will be some of them that have hundreds and hundreds of records. But you're right. The average archival record is probably just going to have a couple of papers and that's if the people are really on top of their publication work. What else would you expect about the record? Yeah, you'd expect to have a little bit more controlled vocabulary, agreement between some of the things so you see here, you have them RNA in the line and in the place where you choose what kind of biomolecule it is you get the short name of the gene as well as the long name. It's in human organism, and you get generally you'll have a longer base pair because they will have put together different sections, although that's not always the case. Sometimes the best record that they're building off of to begin with will already have the full length of it so that's not always the case but yeah, so you just get a lot more information, and then because it's a reference sequence record, the comments field has a lot of additional information that you wouldn't typically get for an archival record, so if you get the status, remember we talked about the various reference sequence record statuss, so this one is what?

[Class Participant Speaking].


And it should tell you which archival records they built it off of, so in this case, they built it off of these three records going back to what Annie was saying so they built it off these three and then they actually updated this from GI 's or GenBank identifyers so each record they get put in has a GenBank identifyer, so there was already a GenBank identifyer for this and then when they made this record, this one is more advanced and you get a summary written by the people that curated his it about what's going to on here and they month that this is not all of the available publications for this gene. But we're in the comments field so going back to what was said earlier it's not linking you to the EntrezGene for this in the comment field but where should we be getting a link to EntrezGene for this? Where would the link probably be? Yeah, there should be one here under the links, and you see remember we showed you that kind of list of links for the first archival record? This list of links is much more broad , so you have gene as the first choice because everybodies all excited about gene now, right? But you have a bunch of other things and we'll explain a lot of these over the coming three days so I'm not going to go over them one by one now but at the end of the three days if there's things on here you still don't know what they are, let's talk about it because one of the things that you might run into is that somebody is looking at a record and they're like, well what's that or what's that? Or, you know, what's that? And when these lists start to get really long, you might know what the 99% of them are but that last thing that you didn't know what it was will be the thing they're asking about too, so it's good to kind of know what those are, but I think you will after these days, and then if you wanted to order O complimentary DNA of this because you were going to create your own RNAs, you could order the CDNA clone and then do work for this on your own. Okay?

And I don't know why they decided that that link was important enough to pull out, but since they're trying to facilitate fresh research off of previous other stuff I guess that's why they promote that. The next time you come here, they may not be highlighting that piece of it anymore. What they actually show and don't show here kind of seems to fluctuate at somebody's will that's at a different level than mine. This record was also very recently touched.

Okay. And so remember we talked about exon and in kstron before? you can see there's lots of different exons and spline is a tool we won't teach you about but it is a tool that they offer and so you'll start to see things in here that are other NCBI tools but they won't necessarily be connected to those.

And then you can see that the sequence is pretty extensive, and lots of cross, all of these are database cross-references to the other tools. Okay, so let's look quickly at the protein record for this and then we will be ready for a break.

So we'll just go to the live protein record. Sometimes I have to orient myself because I forget like which of the many links I followed out of so sometimes I kind of eyeball myself back up here. Yeah, I'm really in protein, and again, you can quickly see again if you're seeing amino acids instead of base pairs you know you're in protein. You have new options over here, the blast link, B-link, that stands for blast link and you'll be coming more aware of that tomorrow and then the conserve domains we already talked about a bit and then lots of other links.

Different links because you're in the protein database now although you still see gene as up top, you now have a lot of other protein related things that you didn't necessarily have before, and the protein record also links back to the Ref Se q record from the nucleotides that we were just looking at. That seem pretty straightforward to everybody?

Okay, so let's go ahead and stop and take a 10 minute break, and at 3:15, --

Okay, so welcome back from break. I realize there was one last slide I hadn't shown you, it was this sample slide of a Swiss-Prot record, but I think at this point you can just see that it's structured differently and we're not going to worry about that right now. I'm just going to scroll down to the end of it. A lot of it is very similar and what we're going to do now is give you time to do the exercises, and one of the questions that was raised over the break that I really hope to put in more context for everybody after you do these exercises is just what databases were we looking at, when we looked at all of these various records. One of the most common questions I get from people who are either from other countries and used to using DDBJ or the other groups besides GenBank that are in the group, they're used to using those tools and they come to a U.S. Lab, and they say, I want to search GenBank and they're looking for how to search GenBank and they aren't getting it, it's because GenBank is an over arching thing that's comprised of nucleotides and proteins, so they really can't just search GenBank. They have to pick one or the other. They have to choose to search in proteins or they have to choose to search in nucleotides, from that perspective. So, GenBank overall, you know, that is split into these two searching entities and after you do these exercises when we come into the section this afternoon about searching Entre, we'll show you not just Emtre protein and nucleotide but there's like another 20 some databases that fall mr the Entr e search system structure so we talked about Entre cancer chrome O soaps and Entre structure and all of these other databases that are all interconnected, they all use the same search back bone, but the kinds of data that are in each one these little data pockets is different, so going back to the the reference work that you were talking about and the kind of data domains that David presented this morning, kind of the first question for the person is well what kind of data are they actually looking for and what kind of data they're trying to look for will drive which of these various Entre pockets you go into basically.

And then this issue we were talking about archival vs. Cu rated, both records are together within Entre nucleotide so both kinds of records live there and when we do the searching I'll show you that the system, you can either limit your searches so just saying I want the RefSeq part or it will present you with a little tab to just say, you know, I did my search and I just want to see this part of it so they're both parts of the same hole and you can choose to focus in on one side or the other and search the whole combination that makes up nucleotides or proteins. Does that put it more in a context a little bit?

Okay, so for these exercises, there's this kind of exercise for each of the four. Everything has, these are numbers that appear up there, so just go through each exercise and you can do them individually or you can do them together. I really don't care how you do them. We can also, if you really don't want to do them yourself, we could do them together as a group if you want to do them in the shout out way that's fine too.

[Class Participant Speaking].

You guys want to do them together? Okay, and you can go ahead and do them on your machine as well. Number 5 is that question I was talking about before, that I didn't want you to get all raise it about, because we will look at what the system tells us, but the right answer to that is we must ask NCBIs information at NCBI@nih.gov, okay?

So this first record, do you know just from looking at this what this might be in terms of what kind of sequence it is? Hearing a lot of different answers, and so --

[Class Participant Speaking].

Okay, I'm going to go to the NCBI homepage in terms of our searching for a second, so everybody is looking at this record. Do we know without searching for it what kind of record it is? Not unless we remembered all of those rules about number s that don't exist, so, let's go ahead and just go into, you have options, you have several options. This is NCBIs homepage. One thing you could do is search all databases for the number, and see what happens. You could search, you could choose to search in the within that you think is the most likely guess and see what happens. So, if you really are just confronted with a number and have no idea, you can approach it, I did click on " Go", right?

I thought I did. Okay, so, when you use that all database function, it does carry you into the Entre global query. Has everybody seen that before? This is kind of if you don't know at all which of the Emtre databases you want, you can start here and see what happens, so the fact that we get results from only a few of these tells me that there's a record here for core nucleotide, maybe it's from that. And in fact, so I followed that link. Okay, so now we've got the short version of the record, can we tell from the short version what kind of molecule it is? M RNA, which I think somebody said to begin with so somebody is really advanced or really far ahead or made a very good guess or all of the above. So you've got it in two places. You've got it up here and sdoouvr got it zin the line. Okay, what was the next question that was asked on the slide.

What else are we being asked?

Okay, is there an older version of this record? What do you think? Yes. There is because we have multiple versions. I think my light pen is dying on me, as we speak. Someone put something in the key words fields for us. Yeah! At least we got something. We have multiple version numbers. If we wanted to see the deal with these version changes, where we would go? Right. Up to reports and look at the revision history. Am I sharing? I am sharing, yeah? Okay. How do I get back to where I was with the sharing? We'll learn this software by the end of the three days. Reports, revision history. We have lots of revisions to this record. But you see that some of the revisions gave it a new version number, some of them didn't. There's a lot of detailed information about what constitutes being worth giving it a new version number or not, that we can read. Basically, every time, a lot of this is only when the sequence changes do you get a new identifier. These two they were just changes. At this level they gave it a new GenBank identifier. It's not just like every time you fix a typeo you get a new number. That would be -- [ Speaker Faint/Audio Unclear ]. Okay. In terms of the answer to the other question, what question do we have in just the question about when it was released? Which lab? Okay. All right. You said you had a question, is it related?

[ Class Participant Speaking ].

Okay. Let's see, what does it say? The comment, let's jump down to the comment section. This is the based on what the humans typed in. What is in the comment field versus the history, versus what is on file on NCBI might not all agree. We had other updates, I think, since this date they didn't bother to type into the comments field. They stopped using the comments field in this way, that happens a lot. When they find themselves repeatedly doing something manually they will make a tool. They're really good about that. If you find yourself doing something all of the time with NCBI that is manual or tedious or you wish there was an easier way, if you email them and tell them they will probably find a way to make it easier for you. If you are doing it all of the time probably the likelihood that others are doing it is high too. They're responsive to email. They've said send us questions that you run into. They do a really good job with that. You said the other question was about the lab. Whose lab is this? Yeah, [ Indiscernible ] research institute, right. Yeah. Okay. Did we answer all of the questions? Okay. Yeah, we're finished with something! Okay. We just need to do the same thing for a couple more records. We'll graduate you as being up to date on how to decipher these records. Okay, so let's follow this record out to Entrez. What sined of sequence is it? -- kind of sequence is it? You will probably get to it before I do. Yeah, it's a protein. What else did it want to know? How long is it? Amino acids, okay. What is next?

[ Class Participant Speaking ].

What does the DB source field tell you? Here is the DB source field. Right. What often happens with protein records. David alluded to it this morning, remember he was talking about how many proteins are don't experimentally as proteins, versus how many are done as translations as the nucleotide records. Does everybody remember that, vaguely? This is what we're talking about. The database source of this, this came from a translation of the mRNA records. It took me from protein to being a nucleotide. Do you see that jump? This was a record where the source of the protein information was from the mRNA record to be begin with. You might have other records in the protein database that was brought in from Swiss prot or other protein tools that the database source might be one of those named protein databases. A lot of what you will see now is carryover from those. Did you say something? You keep nodding. Claudia, what else? You seem to be like -- what lab, okay. Is this the same lab? It's the same lab. Okay. All right. Yeah.

[ Class Participant Speaking ].

Yes. Okay. The submitter, that's an interesting question. The submitter and the paper authors might be different people. You know how sometimes the last author is the head of a lab, the likelihood that the submission account for these is also the head of the lab, that might be well be thecation. Even though this first person might be the post dock that did the work. The submitter account might not be the post dock. I have no idea what the actual situation is here. I'm just say it may not be each individual person in the lab is doing person-based submissions. A couple of people in the lab have sequence set up, they're submitting this even though this person is the submitter even though somebody else did the technical work. Could be who has the grant money. I don't know any of this for this either, okay.

Okay. Moving right along. Okay. This is the same record or a different record? I'm starting to lose my sense of what is -- is it the one we just did? Okay.

[ Class Participant Speaking ].

Let me close this. We did this one. Okay. All right. Now we've got a new record. The fact that you get this two-letter code with the underscore, what have you been seeing those numbers with? What kind of records have you been seeing? Yeah, with the curated referenced sequence records. Okay. Let's see what is the question set here? The same question set that we've been seeing? Okay. What kind of sequence? MRNAs, great. Several ways to find the answer. Where did we find the answer? In the top line for the [ Indiscernible ] type, okay. Where else? In the definition line, sure.

[ Class Participant Speaking ].

That's the code they used for those, yeah. Sometimes you can tell by the prosecute fixing of the number, what it was. What else do we have? I've scrolled down to help you. In the features, in the features key, in the source you have [ Indiscernible ] type. That's enough ways to identify. We've exhausted ourselves here. Okay. All right.

This gene is involved in what? Maybe we should do the second part first, what organism was this? I'm hearing two things. The hope samans and the [ Indiscernible ]. This is the example that I was talking about before. It is from the samples from a human. You see that in a couple of places. In the [ Indiscernible ] one you see this gene is a [ Indiscernible ] first identified. They've got that in the naming of it. That's where you might find yourself a little confused about what you were seeing. This is from human, definitely. The other question was what processes is this related to, is that right? Sir caidan rhythms, anything else? Where is the sir caidan rhythm? Sometimes it's in the title of the article. To be fair, this is why we've these gene reference and functions. You know that how people title their articles is often very contained by the space. The reference function which describes what they think that the gene does in that article, those might be more accurate than the tight, depending on what you have available. Okay? It might be more specific too. In the title you have a certain amount of space to make your point. Yes?

[ Class Participant Speaking ].

In the definition line. It says here homosapiens period [ Indiscernible ].

[ Class Participant Speaking ].

Do you it by what is in the organism field. You have human here. Somebody had said [ Indiscernible ]. You could see why they thought that. When you are not sure from the definition line you jump down to the organism field and get your answer from there.

[ Class Participant Speaking ].

Okay. Sorry, let me go back. The gene was sequenced from what organism? The gene represented in this record was sequenced from human. It does so happen when you are talking about homology this gene in the human is the same as the gene in [ Indiscernible ]. Right. But there will be a record in this database also for the gene in [ Indiscernible ]. If you looked at the sequences they're probably the same. This is the human record. There's a [ Indiscernible ] record, if you searched and limited your search you would find the [ Indiscernible ] record. I think that does confuse people all of the time. They do plain text word searches, they see the mouse this, and the human that. They're like which way is it? Okay. What lab submitted this record? Please say it's a different lab. Is it a different lab? No. Is this not the same record, again? It keeps opening that record, I don't know what -- yeah, I think I have a lot of windows open. This is the record? That's the record, okay. Did we have a submitter? That's a good question. They're bringing together, remember we talked about the comments field. They were bringing together a lot of records. Whether some of the records, yeah, you might be popping out. This record was not submitted by a lab. It was brought together by NCBI staff by one of the organism specific groups that does curating for NCBI. You are right, unless they had as a reference here something about the submitting, which normally they wouldn't because they would be literature references. When you say what reference was it derived from, if you follow these out maybe one would be a submitted record. It's kind of a trick question.

[ Class Participant Speaking ].

Really? That doesn't surprise me. At this point I'm beginning to think that every returns to [ Indiscernible ]. Okay. Let me open this again. What was the gene symbol for in gene? PER2. Where did we find that in the record? Right. So the best place is under features. I'm going to use, remember these little jump downs. I'm going to use the features jump down here. The concrete way is to see the link here. That's the most, kind of, I'm absolutely sure, that's what they're labeling this section as. Just based on the way that people normally write things, the fact that they have it up here like this is the pair thetical, that's the gene they're talking about. You are right, the best place is to see if they had annotated the feature of the gene and find it there. You will see stuff up in here that you might see three different gene names and different things and not be sure which they were talking about. Okay? You can also see in the gene reference and functions they name the genes they're talking about. Okay.

What do we have left? What is the base span of the coding sequence? And what is it's accession number of the protein record. Let's do the accession number. Where would we get that? This accession number, that's a good question. This number should be for the proteins, excuse me for the mRNA. I'm not sure if this was meant to be. Let's see what this is. Yeah, normally what I do is go down to where we have the protein and go to the protein record down here. I don't know what that XP thing is. I'll search on it in a minute. I jump do where they have the protein translation and look for the accession number. Or sometimes I even, if I want the protein I will use the link and pop over to the direct protein link from there. Let me see what this number is, I don't know what this was. Okay. There we go. Remember we talked about if your record was replaced they would have a big red thing. This was a previous record that had been replaced by the record that we had been looking at. It was just to help with cross referencing.

[ Class Participant Speaking ].

Absolutely. You should also have the protein ID number here. Which is what we were talking before. Which is right before you have the translation, you have the protein ID link. What was the other question? What is the length of the coding sequence? David, you want to add something?

[ Class Participant Speaking ].

I'm going to repeat what he was saying. The other record that we found that we said, well, this is no longer an active record, it's a record from a lab that was interested in a different dpungs of this particular Jen -- function of this particular gene. That's what this submission is related to. Going back to finishing up our exercises. The other question, what is the span of the coding sequence? Coding sequence is abbreviated how? CDS. Right, what is the span for this? Yeah. 238 to 4005. Um, you do see they have some other pieces of it that they're talking about exons. The question we had was about the coding regions. We're jumping down to this part of it. Okay. Does everybody feel good about the fact that you could complete all of the exercises together? I think everybody is like, it's only an hour until we get to leave. [ Laughter ] Okay, that was the last exercise, right? No, there's one more. Oh, man, you guys let me say that even though you knew? Let's bear through.

What kind of record is this? You should know from the accession number. It's a protein. What kind of protein? Curated. At this point I will not force you to tell me how you came up with that answer. We've got that nailed, I think. Now the question is, what was the question? What chromosome? Okay. Where do we get chromosome information? I've jumped down into the features. What chromosome is it? 2, okay. And let's be more specific. Where on chromosome 2 is it in terms of the chromosome map location? Right. Q37.3. On Friday morning when we do maps you will be able to localize this in a very beautiful way. What else do we need to answer? Are there any conserved domains? So, what is another term they would use here for conserve domains dough tensionally -- potentially in the features? [ Indiscernible ] or coding regions. Every domain should be in a coding region, that's true. It's more specific than the coding regions, you have CDCs. You have several sites. You also have these areas called regions that they have [ Indiscernible ] cross referenced to. This might be a case where you would have these regions. The regions will be particular sections that have their own subrecords. Then you will have these sites. Okay? So because these RefSeq records are so long and complicated if you just want to look at small secretaries of them -- sections of them it will let you jump off to look at the smaller secretary sections. Let -- smaller sections. Let me follow this out to show you. This is the region record. This is the pass domain. This is the region record. If you go back below it you can see there's also a site record tied to the same cross reference. There's a region record and a site record for the same thing. There's an active site within that particular region. We'll go over conserve domains a little bit more. For now don't worry about it. We'll come back to that again at a later time.

Anything else? What can you deduce about the related mRNA? At this point I'm going to say to you if you are going to revisit these modules there's also this lovely section here called answers. If you're sitting here like I am looking at this and going there's many things I do deduce about this, you can always feel free to go over there. Let me go back here for a second. Let's see if there's something to come up with off of the top of our heads about the mRNAs that we could get from this.

I'm looking to see. I can't remember what it was she was trying to get at.

[ Class Participant Speaking ].

Yeah. Yeah. You went, okay.

[ Class Participant Speaking ].

You went over to that to see what was going on? Right. Okay. That's a strategy that you can do. Claudia?

[ Class Participant Speaking ].

You think that from the comments that they were talking about here.

[ Class Participant Speaking ].

It's hard, this is relating to what the sites within the domain do. This is where I might kick it back to the discussion we had about mRNA this morning. And which parts, actually, I think part of what she's getting at, I'm going to look at her answer with you in a minute. She's trying to get at that the fact that the mRNA ray has codes for lots of things, there's sites within it. Whether that's something on going on with that, that's the part I'm not sure. Let me just see.

[ Class Participant Speaking ].

Really, it was that simple? You mean we were working all that hard for something that simple? I thought they really meant biologically speaking. Sometimes the questions, if they seem really hard, they're probably not. David, you should save me from those moments, occasionally. When we get to day three, there's some I think the way they're worded it makes it sound like they're trying to get do science, they're really not. If you find yourself doing that, remember that normally when you do you'll be sitting with a scientist, they will tell you what they meant. That really was that easy, huh? Okay. All right. Yeah, now that really was it, okay.

We've already been through all of this other stuff. What we will do now is go to the Entrez searching module. Unless you are really, I can't use this laptop and I don't want to, I think you should do a lot of these things on your own computers. Stop and take a minute. We'll going to go over a lot of things. Depending on how familiar you are with certain features of Entrez, some things will be new, some things you have seen a thousand times before. Let me just apologize in advance to anybody who's skills I make fun of when I go over this. We had fun with this. If everybody has ever searched the literature for anything, this is really, really basic. Let me just apologize in advance. We'll move through those quickly. Let me close out some of this other stuff so I don't get completely confused.

So we've already addressed a little bit about Entrez's back ground. It cuts across a lot of kinds of data. It's a search system that underlies the pub med search and the pub med database. It's one collection of data that lives under the Entrez umbrella. They divide it up by the kind of data it is. You can't, you normally don't just search globally. You have to pick the data that you have and focus. We did end up using this global Entrez, for those that have been around, you know this global Entrez search is a pretty new development. It's only been a year and a half or maybe two that they rolled this out. Prior to this there was no way to search across all of the Entrez databases. You had to go into one and link to the others. Or go into one and change the database to pull the different ones. In a lot of ways this is a great search. I'll say a few things about the positives to it. And some of the neglects to it. What they have -- negatives to it. They have the tech-based grouping. This is the area where librarians have lived and feel comfortable with it. [ Indiscernible ], [ Indiscernible ], in animals, things like the site search. This is text land. Now it's textland with links to all of the nontext [ Indiscernible ]. You know what I mean? These will link you to everything else, it's just that's the world of text. This middle cluster is the kind of group of domains, I don't want to use the word domain. Domain now has a specific meaning that we know. Let me find a new word to use. These are all of the databases that represent particular kinds of data. Some of them we've spent a lot of time with, you notice these are not in any order. People get to know where the one is that they use the most often. As they add new resources to this they don't always add them in the same orders. DB gap is new, it's one that I'm excited about. The Framingham studies those studs are deposited into this. This is a new database, it's been publicly available for less than a year, probably. It's all the way up here on top, above something like the conserve domain database, that's been around for several years. There's no rhyme or reason. Just when you get used to that mental map of where something is, it's not in the same place that you remembered it. It's exciting, you know, when they release a new thing. Sometimes you've not heard anything about the new thing, all of a sudden the new thing is there. It's a fun discovery tool. You can see some of it is very specific. The gene expression atlas of mouse central nervous system. Very specific, right? Then you have things that are very, very, very general. Like every expressed tag that was part of the human genome mapping process.

[ Class Participant Speaking ].

Yeah. Mouse, and a lot of other things. You go from the totally specific content to something that covers almost all organisms, almost all of this or that. Some of the other ones like the cancer chromosomes, those of thaw do a lot with cancer research, you should have a look. Depending on what area your scientists are in it bay we hooive you -- behoove to take a look. As you scroll down there's the reference set, which is just mesh, the catalog and the journals database. Because they fall under the Entrez system, they're here.

Okay. Basically, because it's trying to do a search across all of these data types the most effective apreachs for these tend to be text words. The name of a gene, that cob the short symbol or the full name. General text words, you know. You won't get good results plugging in sequences and things like that. This is designed to cut across all of the things. If you had a particular sequence and you knew the sequence you should pick the tool that fits that, generally.

We'll do a couple of searches to see how it processes them, in just a minute. I will leave this open for now. And go back and see what we have. Um, the Entrez data model. I think most of you have seen this. Let me open it just in case. This shows how all of the database are interrelated. You mouse over one it makes highlights to everything it links to and how many links there are. If you're trying to get your head around the big picture, this data model may or may not be useful to you. Are there questions about the data model? It's not up to date, if you saw numbers about the links, and you were like that's not true, you are probably right. It's not live and updated each time they add a record, you know to change the records. The biggest thing about Entrez hard links and neighbors. People really need to use the hard links as ways to make connections between the various Entrez databases. People will be confused about the difference between the hard links and the neighbors and want to not use one versus the other. Although, there's four ways to search here, you could push those into about one and a half ways to search, when you are talking about experienced researchers like all of you. This is the global queery slide that we're talking about. Some of the benefits of the global queery is you could have no idea which database is the right one for you. If you search on something, I'm going to search on the human colon cancer. That's an example we keep having. You can quickly see which of these, you know, which of these database has records about human colon cancer. You can either jump to the database or say I'm going to focus my energy on some of these particular databases that have more information. You see they have this thing that says if the results are in gray that indicates that one or more of the terms was not found. Just like pub med will break up your search if it doesn't find the whole phrase. This too will break it up, and break it up further, if it doesn't final anything it will say I found something about humans and something about cancer, but nothing about colon. How it's processing it depends on each of the databases.

Some of the numbers may surprise you. Some of the numbers may not surprise you. Um, are any of these numbers surprising to anybody?

[ Class Participant Speaking ].

I'm just looking to see if we had any grayedout numbered. We don't really have any. What are the some of the things that seem in line with what you would expect? What databases would you expect to have information about human colon cancer? Pub med, for sure. Clearly there's lots of papers. Probably none of this up here surprises you up here. What about down here? What databases would you expect?

[ Class Participant Speaking ].

Right. Human colon cancer is not in and of itself an organism. It doesn't use the fact that the human colon cancer was in human as organism, yeah. If I had done a search on human in the organism field and colon cancer it would fail in tax onmy. That's a good question.

[ Class Participant Speaking ].

No, it's not. That's a good question. Does that mean there's no organisms involved? No, it means in the roars for -- records for the organisms in the tax onmy piece of it say the words colon cancer. It doesn't mean that within this database that the records about colon cancer aren't in humans, it just means that database as a stand alone doesn't have the three words in a record. Does that make sense, Carol? Let's look at these two. You would think there would be more than two population sets. I think there's two pieces. One, for it to be in the pop set database the submitters have to submit it as a population. So somebody who is doing research on colon cancer, is just working with one colon cancer cell line in their lab, that's not a population study submission. Even though colon cancer affects a large part of the population and there's lots of studies of large groups of people, um, it's not going to pull it that way. These are the two, it's based on the set. It's not that there's, each of these sets might have lots and lots of different samples in them. You might have a thousand samples in one set. This is saying there's two sets. That's a fair question. You would expect to see something about colon cancer in the cancer chromosomes, right? Which you do see. You would expect DNA, mRNA records, right? Would it surprise there's only two three dimensional structures? No. Yeah. You have 700 proteins, but only two 3D structures. That doesn't surprise me at all. Part of it might be a retrieval issue. The reality of it is there's not that many structures in here, for sure. You just move along. Part of this is just in terms of understanding whether it functioned well for you or not. Part of it is just your sense of how well the retrieval matches with what you expected much your presearch probability. The other piece, is sometimes weird things happen. Or you don't know anything about the database. You don't know whether to be surprised or not. I'm not surprised by one DB1. DB gap is new and small. I'm not surprised by that. Somebody who has no idea what DB gap is may be surprised that's only one entry result in there. The geoprofiles is interesting. I wouldn't have thought 96,000 expression profiles, but -- what happens sometimes, I think you have to look at this. Remember I said that it breaks your search up. It's supposed to tell you by marking with gray. Remember? If we look at the results that we got you see the search here, right? If you look at details you do see, you can use details for all of these Entrezs. It is recognize human was the organism. It did it as an "or." Some of these things, you would sit here and say this is about an H.I.V. binding protein, yes. But the experiment said something about the distal colon. Where you see the wores in these various records you may be more or less surprised. If you look at how many, there's a lot of samples in each of these experiments. I was surprised by this 96,000 records. Because cancer is a big area of work, maybe it's not. David, were you surprised? Yeah, yeah. It depends on how familiar you are with the data. There's been plenty of other topics I've searched on where there's nothing. What I'm trying to look at people are not doing experiments on them. Or they don't express well. At any rate, becoming a little bit familiar with these will be helpful. This is the overarching cross searching much each one of these you can go directly into the particular database. Each will have its own limits. Each will have its own set of cross linking. Not all of the NCBI groups, let me see, everybody at NCBI works on a different database. They make improvements in structures and visualization and help tools all around their group. It all fits into the Entrez framework. The way that you are used to visualizing something in Entrez Gene, the other people don't have to set up their records in the same way. It might look really different to you. Claudia?

[ Class Participant Speaking ].

Okay. A substance name for each of -- actually none of the mesh here is [ Indiscernible ]. They're all protein names. Which makes sense. It's not doing the cross referencing, you know. It's a preference issue. Did you have a question?

[ Class Participant Speaking ].

Okay. The question, when looking at this list of databases, does it actually indicate in any ways which of these are active or not in terms of continuing to accept records? My gut response is that everything that is on this list is active in accepting records. I actually can't think off of the top of my head of anything that they've closed that that's haven't transferred to something else. There used to be a database called locust link. It was a precursor to Entrez Gene. For a while they were running both. When they swapped it over locust link went away. Some databases get less action than others. One is all about visualizing cancer chromosome break points and things like that. I'm always on researchers who are doing that kind of work to submit their data. Part of it is finding out what kind of data your people generate. There's tons of people doing [ Indiscernible ] experiments. Most of that data is not getting submitted to GO or any other Miami complaint, Miami is a standard for presenting that data. They're not getting put out. They're stored in local databases in lab. It's not deposited in here. GO is probably not a great kind of cross cutting representation of the research being done out there in terms of expression. Whereas GenBank is a great example of the nucleotide work that is going on. There's other examples, that's the one that comes to mind. Kara?

[ Class Participant Speaking ].

That's a good question. Whether NGO, whether any of these relate to clinical trials. I think there are probably some small sets in there that are. But I think that, you know, part of the, what is the word? I do think there's a big disconnect between the clinical research enterprise and the data. Basic scientists are used to depositing in these tools. Clinical research, when funded by large groups that have their own constraints, that's why the DB gap database is so interesting. A lot of large studies are being deposited into that for whole genome studies. Those are some of the first examples of more clinical studies that will be deposited. All of these big federal studies, you know, will go in. You will start to see other federal nonclinical studies that go in, once fully published out. That's a good pointful we'll try to make -- point, we'll try to make that more apparent.

[ Class Participant Speaking ].

Basically the reason you might see more cancer in the expression, I'm Peteing it, it's easy to see the difference in expression between cancer cells and noncancer cells. Most of the data in here is from more experiments being done in science as opposed to clinical research. Really good questions, everybody.

Okay, so here are some other examples. We did the human colon cancer without quotes. Instead of doing the search again I want to take a minute to read that expression. It relates to picking up organism records even though you know they're related to colon cancer. So in Entrez genome it's saying that even though clearly the human genome has colon cancer, it's indexed at that high level [ Speaker Faint/Audio Unclear ]. A lot of it has to do with the indexing level of the databases. With is something that you will have no problem looking at and understanding the depth of indexing. The gene examples.

Let's do my favorite example. Let's see. People do this to do the quick and dirty search. You may find different answers. If I search on PSEN1. This is the gene name. If I want a more comprehensive search then I can put in different, I can do more ors, this is, you should know from looking at this what is going on. Are you surprised there's 4400 records? Yes, you should be horrified, right? It's gray. What do you think happened here with DB gap? What do you think it dropped off? Do you think it dropped off [ Indiscernible ] or the number 1? Yeah, you see it didn't find [ Indiscernible ], or [ Indiscernible ], but that number 1 was a good search term. That's not what I intended. Okay. That's kind of how it deals with the fact that it can't manage everything that you are oftenning it. -- offering it. That's just to give you a sense. You should consider your normal good librarian search strategy of multiple names or things. That -- for things. That holds true for this, as well.

Let's move through this quickly so you have a chance to do the exercises. The exercises are by far the most fun. Okay. We're going to move right along to do some searching. Let's so first this Entrez core nucleotide search. Let me close the other ones. This is the Entrez nucleotide home page. Overall, you know, every database that NCBI has pretty much has had black bar. Have you seen that? What you see in pub med might be different. One of the advocacy points, after David's impassioned talk, we're going to say please put gene on the black bar. When you don't fine on the black bar you can search for on the side. You can search in which ever one you want. I caution against doing that with users. I have found if I'm in nucleotide and they're seeing nucleotide flashing in front of them and I decide I'm going to be efficient and go over here and say I'm going to which ever I'm going to search, I have confused people by using this to search one of these other ones. Myself, I do it all of the time, I tend to want to show them how to get to that database first. If they're with me I will do that. At first if I'm working with somebody new I show them this list. I try to go to the right database first. I go to [ Indiscernible ] and go. It's an extra step. It helps the user to make the transition. Which tool am I in now? Does that make sense? That's a personal preference issue. I'm just telling you my experience. We were supposed to search for human colon cancer in nucleotides, right? What do we think will happen? If we do this? Just from our own experience so far with Entrez and every other [ Indiscernible ] of the world? Without even looking at the details. By the way, you have to choose one of these before you can see the details. These are component databases of Entrez nucleotide. Core nucleotide, [ Indiscernible ] are three component databases. You have to go forward with this step. I will choose the core nucleotide. You selected core nucleotide to begin with, maybe.

[ Class Participant Speaking ].

Yeah, it might. From the slide, I think it had linked you to the core nucleotide. Most will come it from the nucleotide link on the black bar or from the NCBI home page. That's the generic nucleotide. They will do the search. Do you want core nucleotide? Do you want EST? Do you want GSS? This is actually new, this top organisms display. That's new. And this also, trying to get you over to gene. They tell you, yeah, this says this same search in gene, instead of looking at 14,000 you can look at 351. Most would be happy about that. You can limit to the curated records. They offer you a tab to do that, up here. Also the mRNAs. Also this little tool function here. If there's other limits that you want predone, you can use that little tool to do it here. If I decide I'm only going to do research on humans, I can use this tab to set up a human limit. Then I don't have to go into the limits field and pull it down. You can customize those.

[ Class Participant Speaking ].

You have? Yeah. In pub med you guys do that? Maria is saying if you are a share NCBI you can customize it for your institution. I have just customized it for myself at this point. This is the easy way to go. 331 RefSeq, 351 genes. If you choose your organism, you get the idea about some of these. Okay. We went over the displays a little bit this morning in terms of we were looking at individual records. Do you remember what they were pulling off of the individual record display, earlier today? The fast A. If you wanted to get the block of sequence out and a way to use it another tool you pull it out in fast A. Now at this level you have all of the different links. And so every day they add new acronym links. It's something that I'm not interested in, and my users are not. Trying to think which of these I use most. Probably, you know, now that they have the little [ Indiscernible ] search I may not need to ewe the gene links as much as I used to. These will make you to the hard links. If I want to say out of these 33 is, how many of them have links to [ Indiscernible ], [ Indiscernible ], and [ Indiscernible ]? Out of those records there were 594 direct links. Some of the records must have had more than one link. You could have a sequence that will multiple genes, right? Or the gene could be associated with multiple conditions. These are not, these don't have to be, 331 of those. Remember we saw that record that is one RefSeq record but it had multiple pub med articles linked in it. For every one RefSeq record we might have 15 pub med record, or one. These are the hard links. You can do it for the group as a whole. We said show me for all 331 roars all of the -- records all of the links. You can also select the ones that you are interested in and just take those. Let me just choose this. Let me just pick a couple of these. I'm not picking these for any particular reason. Only show me the links for those, I'm trying to get it to do it, not really wanting to do it, sometimes when you are gone somewhere before it doesn't want to go back easily. Now I'm saying I want the records that were related to these three that I had checked off. So those three, there's 18 different records that were linked to it. Sorry, that was multiple steps, because I had already been out to [ Indiscernible ]. It didn't want to push back to [ Indiscernible ]. All right? Okay. The linking from the display menu and from the individual roars. The linking from the records is a mix. In this case it's all direct linking. If it were related it would say related sequence, or related structures or something like that. Okay?

Everybody okay with that? Let's look at the searching a little bit more. Details button functions the same way. Let me pop back for a second. Let me go back to the history and get the previous queery. The big giant human colon cancer queery. The details and the history work the same way. So if you really wanted, one of the things that you need to remember, I know this -- it was hard for me to make this transition, you know in pub med you don't want to always use the mesh term, you leave these all field searches in so you don't miss current stuff. It will not help you here. Everything in here has an organism record already. What you really want is the human organism. There's no value in having the all human all fields. If the record is in here, it has an organism on it. Your not waiting for it to be indexed. You can till that line without worrying that you are cutting off anything. Does that make sense? It's controlled always. There's no period of time in which it's not controlled. There's no value. If what you are looking for is things done in a human organism, there's no value to have that in there. That should cut you down a little bit. You know. Depending on what you are doing, it should cut you down. More than that. At any rate. Don't feel con strained. There's no extra indecking that get -- indexing that gets added, that's it. Okay.

Here's a list of some of the feels that you have. -- fields that you have. The two things that we will look at here in Entrez as we wrap up is that each database has different limits. You should take the databases that you think you will use on a regular basis and familiarize yourself with the limits. This is a list of the searchable fields. Some of these are ones that we're familiar with now, like organism. Other things we'll come back. The feature key we'll look at. The EC and RN number, things like that you probably already know from pub med. We'll focus on properties and features, which are things that are more unique to these Entrez databases when we go into preview index. You can exclude the third party annotations. That's what that TPA is. You can also exclude pat ens and the tag sites.

The thing you need to remember, this is for pub med too, the options you see here are not all of the options that are available. We were working with mRNAs today. There's tons and tons and tons of kinds of RNAs and others that are not listed here. These are the most common ones. If you don't know what other, if somebody comes in and says I want to limit my search to [ Indiscernible ] you need to use the preview index to pick that. Then gene location, we talked about mitochondrion DNA. You can link to that. If you are working with plant people, they want things in the chloroplast, you mentioned that earlier. You can say I only want RefSeq. This is the alternative to using the tabs, to use the limits. You can say only show me recent records. If somebody wants to set up their search and say notify me when there's something new related to this, they can alerted set up for that, too. Um, okay.

Because I've been talking about preview index, let's have a look. This is probably the most confusing thing. I don't know how to improve this right now. Let's look at the properties field in the preview index. I'm going to change it and hit index. You can see what is going on here.

Remember that small [ Indiscernible ] RNA person? You might think the right thing is go into properties and hit S for small [ Indiscernible ]. That won't be how it works. Everything is started by a code that indicates what kind of thing it is. See these SRCs, that means source. Bless you. Our small [ Indiscernible ] is a type of biomolecule. What I'm looking at here, you see all of the different types of bio[ Indiscernible ], that's where I have to go. You can see the numbers of what there are. If I'm looking for small [ Indiscernible ] RNAs, I want to go by the person indicating that's what it was, I need to come here, go un[ Indiscernible ] and look. That's 700 of them. The tricky part is figuring out the right part that comes to the front of it. That's where the feature key. The slides have a link to where you read up on the feature key. I would click on it, click and to add it to the search, now the reality is it I'm not sure there are any. Yeah, there's none. That's how you would do that. It's not that you can't limit your search. It's just that you have to limit your search to it using the preview index. To use the preview index you have to remember that within the properties field there's that group of things called the biomolecules. You look confused. You are okay, okay.

[ Class Participant Speaking ].

I think we're pretty much, I think that's a good point. That's probably the most complicated thing. Read the feature key. Look at the feature key and the properties. Get familiar with what things might fall there. That's what I think we need to do in terms of the questions.

We've already done this. We've looked at the limits. Is everyone comfortable with the history? We're going to skip over that. You know how the preview index works. You need to be able to fine what you are looking for. What we should do now is go ahead and do -- we've done all of this. We've done the related records. Okay.

This is the slide that talks about what algorithims are used. You will learn BLAST tomorrow and vast in the afternoon tomorrow a little bit. I think you are already familiar with pub med. We will talk about how they come up more in the next or day or two. We talked about the linking much here is the mastery exercise. You do this and then you get to go home. Why don't you take, we have a couple of minutes. Take five minutes to read this question and make sense of it. Figure out what strategies you would use to approach this. And we'll kind of do that together in a couple of minutes.

All right. You are doing a request for proposals on sir caidan rhythm it's, you want to want to study the regulation of genes. You don't know which gene you want. According to this you went to Entrez nucleotide and got 668 references. What might you have done instead of doing that? You might have gone to Entrez Gene and started with sir caidan. You will find the most studies [ Speaker Faint/Audio Unclear ]. If you are doing a proposal, if you want grant money. [ Speaker Faint/Audio Unclear ]. But absolutely, knowing what is already studied and what has avenues for development is great. If you went to nucleotide and found 668 and there were a lot of redundant ones would would you do? Did any of you go into nucleotide and type it? You didn't get 668? You might have gotten more. This is a --

[ Class Participant Speaking ].

You probably got more. What would you do if you had a lot of redundant stuff? Focus in on something where people have put in effort to pull if together for you. One strategy is to go to the core nucleotides, what would you get rid of? What wouldn't you get? Anybody remember? Express sequence tags, the ESTs. You wouldn't get those. And/or the GSSs, you wouldn't get the things from the big sequencing projects that haven't been looked at a little more. Assuming you want more detailed attention being paid to what is going on than even these core nucleotide references. They could be archival ones or curated. What would you do? Right. Limit to RefSeq. Either by using the limbs or the -- limits or the tab. How many does that limit it to? 207. What might you do next? Where did you search? Did you type it in as a text word? If you wanted it to be so important that it showed up in a very prominent place, what might you limit your search to? You might limit it to the title field. Did you try that strategy? What did you get? Ballpark? 7. Going from 200 something from 7. What does that tell you about the title field? What could be going on? It might be too restrictive. What else? Are there other words that people might use? When we talk about falling asleep after lunch do we all talk about our sir caidan rhythms? What other words do we use? I'm not saying that word on the recording, just kidding. There's lots of other words that we use. People talk about biological clocks, people may not talk about sir caidan, they might talk about other biorhythm word.

[ Class Participant Speaking ].

Yeah. There's some kind of happy balance between how tightly you restrict things with the field, how much you broaden things back out with synonyms. That's the part of this that is an art. You will all know the mechanics by the end of three days of this class. The part that is your judgment and your colleague's judgment that you are trying to help, what is too narrow? What is too broad? There's too much research done on that, or not enough. That's the beautiful fun collaboration. I hope after three days you will feel comfortable enough to sit down with somebody and explore how to make a study broader or narrower. That's it for [ Audio Cutting In and Out ]. [ Relay Event Concluded ] [ Relay Event Concluded ]