Event ID: 967291
Event Started: 4/2/2008 8:19:37 AM ET
Please standby for realtime captioned text.
[Good morning. Captioner is receiving a message that the host has ended the meeting when connecting to URL https://webmeeting.nih.gov/r8534 5724. Please confirm URL for event. Thank you].
Please standby for realtime captioned text.
She is right but the drawbridge is coming up and you guys are on the other side. The what you to take advantage of that fact and get to know each other really well provoke this is Europe support network for molecular biology network resources. We are here also, and there is a large group of librarian's now forming out there if you go to meetings to see more and more of this occurring in the libraries. This was a big push from the National Library of Medicine at one Point to get medical libraries to be the centers for bioinformatics support on medical school campuses. This is where this whole thing came about from. Unfortunately, due to the budget cuts, it might not go on for much longer. I think we have enough people out there who have a connection and the knowledge to keep it going within that growing cadre of people. If you keep those connections alive and get in touch with each other and with us, we should be able to do it. We are going to have you introduce yourself so you get to know each other from the beginning. We will introduce ourselves as well provoke my name is a David Osterbur. I am the head of access and public services, which means circulation, enter library loan and reference. I do a lot of things at the Library. The [ indiscernible ]-There are a lot of exciting things going on and we will talk more about that, probably on Wednesday when we talk about librarian involvement in this whole thing. Countway has its own--We are really pushing in that direction. I have a great job. It is very exciting, the things that are happening there. The Countway is a great place to work. I have a Ph.D. in genetics from the University of California, Berkeley. I was a practicing geneticist for some time and then decided to switch because of frustration with grant funding, another funding problem with the government and became a librarian because I always like libraries and always spent a lot of time in libraries as a researcher. I have been able to put those two things together and it has really worked for me. It keeps me involved in the science. It is not a 24 hour a day job. That is good for me.
[ AUDIO/SPEAKER NOT CLEAR].
You may have to remind us that it is time for a break. I tend to forget and just keep on going.
[ AUDIO/SPEAKER NOT CLEAR].
If we get your questions wrong, please feel free to correct as part of the other thing is that I will be sitting back there in the corner. If you have a problem and do not want to bring it to the front of the class' attention,--See if I can figure it out. It is something that everyone is having a problem with--Tomorrow he is doing [ indiscernible ] modules-[ AUDIO/SPEAKER NOT CLEAR] it is going to be informative. I am thinking it is pretty warm. I will talk to them about that. If there is anything like that--for three days, you will want to be comfortable. Are there any questions? A I teach, and I am sure that create an also, we teach a Ph.D. in level classes on how to do this. I tend to go fast because I am used to them understanding what I am saying. If you do not understand or have a question are click on something too fast or go somewhere and do something that you did not see me do, raise your hand and let me know that you did not see that. Either one in the back will help you or I will go back through it and go slower.
[ AUDIO/SPEAKER NOT CLEAR].
I am terrible with names. As long as you sit in the same place every day.
[ AUDIO/SPEAKER NOT CLEAR].
[ LAUGHING ].
I would like to know a little bit about what your biology backgrounds are, too.
[Class participant speaking]..
We don't actually offer a structured set of objectives for this 3-day course. For some people it's way too overwhelming, for others it's not enough. Check in with yourself how you're doing. If there's something we can help you with, or you can help each other with [ Speaker Faint/Audio Unclear ] I will say it's actually easier in my opinion to develop a basic teaching program after these three days than it is to become a [ Speaker Faint/Audio Unclear ] in three days. Because the diversity of what you could be asked to interpret in terms of [ Indiscernible ] is a much broader scheme of things. On day three we'll show you some of the workshops that your colleagues have already developed. [ Speaker Faint/Audio Unclear ]. You will find it easy to see what you feel comfortable with after three days. [ Speaker Faint/Audio Unclear ]. You need to have a basic [ Indiscernible ]. I think after the three days you will be able to pick pieces out what you have learned to do that. Um, so the teaching people I feel we can help you with that. The reference, that's hard, even know. Even after I've been in this program for seven years, people will still bring me a question, I will have no clue what they just tried to do in this experiment and how I'm to help them. [ Speaker Faint/Audio Unclear ]. Remember that you are experts, you have databases to search. [ Speaker Faint/Audio Unclear ]. All of that kind of stuff that we actually tend to do. The tool is probably capable of it doing what they wanted it to do, but they only used the basic [ Indiscernible ] of the tool. Feel good about your expertise, you will invest the time in those things that your scientists might not.
Exactly. Librarians like to search and everybody else likes to find. Most scientists use a tool like BLAST as if it's Google. And they do. They go in and put their thing in and push the I'm feeling lucky button. They don't pursue it. What I will show you, probably 90% of the people at your facility, if they're research faculty using these tools, 90% will not know what you will find out about these tools. You can feel confident that you will able to show them new things, I can guarantee that.
[ Class Participant Speaking ].
I think we are recording, I could see the mic responding on the screen. Okay. Um, this class is set up online as web pages. If you have done the prerequisite stuff, you've seen the course. You can go to pub med and then go to NCBI, you can scroll down to education, here. And we have the medical library association course. That's where we will start. First is built in modules. What is that?
[ Class Participant Speaking ].
Go to pub med. Pub met.gov. -- pub med.gov. Then go up in the left-hand corner and click on the NCBI logo. Okay? Scroll down on the left, you will see education and click on that link there. Okay? And then, um, on the left side is the course. And we're now in the modules. Okay? Everybody here? We're going to start with the molecular biology review. Just like with any other field, a lot of what you learn when you are in a new field is the vocabulary. It's in a new language that you learn to speak so you don't have to spend so much time trying to explain what you are talking about. There is a video, let's see. Here's your reading, what you of course all did, right? Okay. -- which you all of course did, right? It's been a while since I've looked at this. [ Speaker Faint/Audio Unclear ]. I don't know how to get back. Sorry about this. I think we won't --
[ Class Participant Speaking ].
Yeah. Try to keep your sound down, there's going to be a lot of the same thing being broadcast here. It's, um, go to the next. And then just click on one or the other. I guess I could have clicked on windows media player, that might have worked.
We'll keep sharing. All sharing means is that the people in the room can see what you are doing, so.
Right. It still says that you need Adobe flash to play it.
Should you be downloading it?
It should be opening up, actually. Yeah. Let's just skip the video for now. It's just a general introduction to what I'll talk about, anyway. Okay. These are the terms we will talk about. The central dogma of biology is -- does anybody know how to naive -- does everybody know how to navigate the slides? The central dogma states that DNA makes RNA, makes protein. Um, Francis [ Indiscernible ] coined this back in 1958. In 1953 they discovered the structure of DNA. In 1958 they were beginning to look at how DNA is responsible for what is happening in the cell. And so, lost my [ Indiscernible ], um, one of the things that Francis crick has said that is he regrets naming it the central dogma. At the time he didn't understand what a dogma was when he said that. Um, this is all basically true. But that's not all there is to it. Each one of these steps, I have another slide, when I teach from my own PowerPoint slides I have another slide that shows [ Speaker Faint/Audio Unclear ], thanks. I have another slide that shows all of the things that have been added in here since Francis crick made this basic statement. In 1970 there was an article published in nature that said that the central dogma may be a slight oversimplification of the actual state of things in molecular biology. Basically, what we will learn about in this set of things here are these three steps. Okay?
We're going to learn about the structure of DNA, the structure of RNA and how it works. And the structure of proteins and how they work. We will talk about transcription and translation, all right?
Good to know where all of these pieces fit together. Here's a typical eukaryotic cell. I will explain the difference between prokaryotic and eukaryotic in just a second. This cell has a nucleus. Outside in the cell are other organ Els. There's the golgi, the mitochondrion, the ribosome. Some of these components that are outside of the nucleus also contain DNA. They code for their own DNA. There's interesting evolutionary reasons that we'll talk about later on.
Basically we have two different types of cells, eukaryotic and prokaryotic. Eukaryotic cells are cells that have a nucleus. Prokaryotic cells do not have a nucleus. They're bacteria. They generally have a single chromosome, which is a large circular chromosome within them. Eukaryotic cells may have hundreds of chromosomes. All within this nucleus. A cell was originally described as a sack of liquid. Essential it is, 95% of you is essentially water. But a lot of the materials within the cell [ Indiscernible ] and each one of those things that you saw in the last slide, the end dough plasmic reticulum. The end -- a lot of this machinery that you see in cell is devoted to that central dogma function, DNA makes RNA, makes protein. One of the differences between a eukaryotic and prokaryotic cell that is somewhat important is that within the nucleus in a eukaryotic cell you make the RNA. The RNA is produced within the DNA in the nucleus and transported out into this area to make and be translated into protein. You don't have to do that within a eukaryotic cell. So when you are producing your proteins it happens much quicker. You don't have to do this transport. There's not a lot of processing of the RNA. It's a fast process. That's why an average prokaryotic cell under optimal conditions with reproduce every 20 minutes. Whereas a eukaryotic cell most of the time, the average human cell will take 24 hours to replicate. Okay?
You hear a lot of talk about genome. What is our genome? The human genome is composed of all of the DNA within a cell. All of the chromosome and the DNA and mitochondrion. That is our genome. Plant cells have another organ eel that also has DNA. It's composed of the chromosomes plus the mitochondrion, plus the color plasts's DNA. Humans have 22 pairs of [ Indiscernible ] and 2 sex chromosomes. Altogether we have 46 chromosomes when they're in the dip loid state, one pair of each chromosome. Okay? That means two at each, is what diploid means.
Getting into finder detailer -- fineer detailer here. Within the nucleus most of the time you don't see chromosomes looking like this. They tend to not be so compact until they're ready for cell division. You see a chromosome kind of uncoiled to a certain degree within the nucleus. That's because the function of that chromosome is such that it has to be somewhat uncoiled in order for transcription to take place. Okay?
Here's a chromosome. At one end, at each end, actually. You have a [ Indiscernible ]. They're important because of the way that DNA replicates. One of the things that I left out, or that was left out of the central dogma was the DNA repcation. If you've got DNA makes RNA, makes protein, you also have DNA makes DNA. So when a cell replicates it has to reproduce its own DNA. It has to reproduce that double strand, okay? Maybe I should go on a little bit before I get into details like this.
The function of the [ Indiscernible ] is very important. Especially in states like cancer. Where you have a cancer cell because there is a protein that is necessary for a chromosome to replicate. If you don't have a protein called [ Indiscernible ] the repcation of the [ Indiscernible ] won't occur properly. You keep cutting shorter and shorter an that chromosome. You don't replicate all the way to the en. One of the things that has too be happen at a cancer cell this protein called [ Indiscernible ] comes back on in a cancer state. It allows the [ Indiscernible ] to be replicated properly. Otherwise you would chop down until you got into necessary coding regions and the cell would die. There's a structure in the center, or it can be anywhere along the chromosome called a centero mere. This is highly repettive DNA. This is where when a cell is undergoing repcation there are microtubules that attach to the center mere area. The two [ Indiscernible ] are pulled apart when the cell divides. Okay? The reason that cells can hold -- if you were to string out all of the DNA in a human cell you would have over six feet of DNA. At the molecular level that's a lot of stuff. The reason that cells can do that is because they have proteins called his tones that are responsible for compaction of the DNA into these, um, structures. So DNA winds around his tones to compact and be able to fit into the cell nucleus. Okay?
This is not what everybody thinks of, this is the double helix. When you think of DNA you think of the double helix. You have two complimentary strands. They're each composed of a sugar phosphate backbone and [ Indiscernible ] that pairs together in a certain way. That rule makes it very easy now for us to do matching. It makes it easy for the cell to do matching as well, in order to replicate that DNA. A always pairs with T. C always pairs with G. Let me move on before I tell you about different strands. This is more detail on what DNA looks like. Here you have the phosphate. Here you have the sugar. This is rye boas. Here you have the [ Indiscernible ], okay? If you look at the structure here, you can see that there's a difference in orientation. This point on the sugar is pointing down on this strand, it's pointing up on this strand. That gives the directionality to each of these strands. Enzymes that are responsible for transcription and for translation and for DNA repcation all pay close attention to this orientation. You can code for proteins with either one of these strands. That enzyme that, um, transscribes the DNA will be going in this direction on this stand for one that codes on this side. Okay? That's a general rule. Five prime to three prime. You also notice here that the AT connection is, has two hydroJen bonds. The GC has three. That means that DNA that has a high proportion of DC is going to be harder to melt apart than DNA that has a higher proportion of AT. That plays a role in a lot of the molecular biology that is done because, for instance everybody know what PCR is. It's a way of replicating large amounts of DNA in a test tube. You have to know some of the conditions for melting the DNA apart to get PCR to work. You need to know about the composition of the DNA to know to set the conditions for melting.
Here's an even more detailed view. Now you can see why it's called deobjectiony rye bow. -- deoxyribo. Here's an oxygen attached to this phosphate group. This is how the keeks always go. In RNA there's another oxygen that comes off here. Okay? One of the differents between RNA and DNA is the this. That oxygen is there in RNA, it's missing in DNA. That oxygen plays a big role in how stable DNA is versus RNA. Because that oxygen is sticking up here in RNA the RNA can't compact as readily. It leaves itself open. RNA is much, much less stable. RNA is hard to work with in the lab, there's an enzyme that is called RNA's that you can boil and it will still function. That is prevalent in many cells. There's a reason for that. That RNA is responsible for a lot of what happens in the cell, as a cell is going through differentiation it produces different RNAs as it goes through. If the RNAs are persistent the cell wouldn't be able to move on to the next step, you would still be producing the same protein from that RNA. It's easily gotten rid of in the cell.
Here's where the numbering, five prime to three prime comes from. The numbering is based on the carbons in this sugar molecule. The fifth carbon is the one that the phosphate attaches to up here. The five prime to this three prime direction here. Okay? Any questions?
Okay, let's move on. I have no idea, either. [ Speaker Faint/Audio Unclear ]. [ Laughter ] There is a standard for nucleotide base codes, that's important. You are all librarians here. You know how important it is to have metadata that is standardized. This is for nucleotide. ACGT add in [ Indiscernible ], the cell uses [ Indiscernible ]. Everybody knows these four. There are others, this comes about because of techniques that have been used to look at sequence. Sometimes it's hard to distinguish between different nucleotides when you are running a sequencing [ Indiscernible ]. You have to make a call, you can call it, if you can distinguish between A or C [ Speaker Faint/Audio Unclear ]. That video that we missed shows you what a sequencing jell looks like. We'll talk about why these things happen.
Okay. We've talked about chromosome and genomes, what is a gene? Functional and physical unit of heredity passed from parent to offspring. In eukaryotics it's composed of a stretch of DNA that includes things called introns and exons. We think of a promoter region which a five prime to that gene, we have the start of transcription, which is a signal within the DNA, and then we have what are called exons that are coding region within the DNA. You also have within eukaryotic cells and certain types of bacteria called [ Indiscernible ] bacteria he also have introns. Introns are sections of the messenger RNA, the RNA that is produced that are produced out before that [ Indiscernible ] can be turned into a protein. Okay? It's an interesting evolutionary, um, question. Why there are introns in genes. If you look at the arcky bacteria it's interesting they are the only group of prokaryotic that have introns. It's thought they've been put into a straight kingdom completely because of this fact.
Go back one. It says here at the bottom the function of introns is not yet known. That's true. Um, but there's some interesting things about introns. You can actually have a gene that transscribes in this direction that contains this entire piece of DNA. Within this you can have another gene that transcribes in that direction. It's a very complicated, structurally it's very complicated of where things can be with relation to one another.
Okay. Here's the regulatory regions. This is the appropriator region -- promoter region. There has to be some way of turning it on and off for a gene to be transcribed. It happens in this five prime area of the gene, called the appropriator region. These play several different roles. They can open up the structure of the DNA to allow an enzyme called RNA P to come in and start to transcribe. It's down stream from the promoter region. Then you have a down stream region, the three prime end. Sometimes it's complicated here too because you can sometimes have, um, transcriptional activity. You can have transcription factors that bind at the five prime enand the three prime end. This is what is thought of as the structure of the gene, typically. Okay? Once this is produced, once you transcribe that RNA it's called messenger RNA. To transcription is the process of making RNA from DNA. Many scientists have devoted their entire careers to looking at that one piece. How is this particular gene regulated? What are the signals that causes it to be transcribed? We're getting to the point now where we can start to physically understand what is occurring for a lot of the gene transcription processes and control of the transcription starts and why proteins are expressed differentiately in cells. Now we can start to look at how we can drive a particular cell down a particular differentiation pathway. Because we can understand what is happening with these controls of gene regulation. How to turn a gene an and off is important for how a particular cell differentiates. If we have undifferentiated cells that we can direct down any bath that we want we'll be able to do a lot of important medical work. Okay?
The messenger RNA has a structure. Within the genomic DNA you have this area where you started transcribing until you hit that first exon. You have this area that is not translated. Then you hit the exon where it is translated. Then you have a variety of different structures for different genes. And the three prime area. The primary transcript of the messenger RNA looks exactly the same as the genomic DNA. If terms of sequence structure. But that structure gets processed, as I said before, these introns get sliced out, they get cut out. The two ends are joined back together to make this one mature transcript. Okay? In the, um, records that you look at at NCBI this mature mRNA is called CDS.
Cells can be very efficient at how they put things together and how they use their energy. One of the ways that they've been very efficient is what is called alternative splicing. So a single gene may produce more than one product. Because you can splice out pieces differentiately. If I was to splice this piece out completely I would have this transcript down here. If I was to splice this piece out between 2 and 4 I would have this transcript here. Which has left out part of the coding region and part of the functional piece of that protein. This protein that is produced from this messenger RNA may have a completely different function than the protein that is produced from the other messenger RNA. Okay?
This is normal function. Many different, I always like to think, if you look at proteins you've got lots of what are called conserved domains. There are functional domains that do certain things within proteins, they do it within lots of different proteins. It may be a DNA binding domain. There's a particular protein DNA that is used in lots and lots of proteins. If you, if you put that homobox into a protein with an enzyme here on a particular charge on a protein over here it will do different things than it does without that charge. So sometimes a DNA binding protein can be a transcription factor that starts transcription. Sometimes it can be something that stops transcription because it's blocking it. Okay? Those functions can be traded back and forth. If you look at, especially in a bacteria cell. They're under a lot of pressure to take their Joe Nomes small so they can replicate quickly. In bacteria you will often final four or five different functions that we need, every cell needs it to be alive, several of the functions that we have a single gene more each, you will find all five in a protein in a bacteria cell.
It's inside of the nucleus. All of the processing for the RNA takes place inside of the nucleus. It does get touched. You have the mature RNA, but there are things that are done to the five prime and the three prime ends of that transcript. There are caps put on the end of a transcript. You add, um, a polly [ Indiscernible ] tail on the en. There are things done to help to make it more or less stable as it's release out into the sigh toe plasm.
[ Class Participant Speaking ].
Yes. Yes. It doesn't have to be. That processing could start as it's being transcribed. There are some, um, [ Indiscernible ] genome is relatively small. There are some genes that are introns that are 70,000 bases long. So, because there is a physical amount of time that has to take place for that transcription to occur. It's thought that these long introns may play a part in the timing. If you start production of an enzyme as a particular point, the processing it takes, it's about 60 nucleotides a second that it can transscribe. It can take up to an hour to do 70,000, just to cut out. It seems like a big waste. But it actually is timing. Setting a timing pattern for production of a particular protein that is necessary for one of the steps that has to occur later on.
[ Class Participant Speaking ].
Yes. Yes. It's called processive. It means base by base. Okay, once we've produced our message. Here you see we have our RNA, which is not as tightly coiled because of the extra oxygen in there. Think that these things have physical structure. They really do need to fit into spaces. They need to have relationships to each other. And they do physically interact with each other. That single oxygen atom plays a big role in the structure of this RNA. That's very important for, when we start to think about protein structure. These things really do occupy physical space. That's important for what they do. Once we've produced our mature transcript and put it out into the sighto plasm we can translate it. Change it from RNA into protein. That's done according to a code. Every three nucleotides produces one codon, each codon codes for one amino acid. Okay? So your messenger RNA will have three times the number of the nucleotides than the protein it produces will have amino acids. Translation is outside of the nucleus into the sigh toe plasm of the cell. A bacterial cell can transcribe and produce protein on that same [ Indiscernible ] as it's being produced.
[ Class Participant Speaking ].
The codon is three nucleotides that will be translate police department into one a.m. -- translated into one amino acid in the protein. I think we may have a codon table. Might make it a little easier to understand. Oops, sorry. [ Laughter ] Here's the genetic code, okay? You can see it's a fairly small and simple table. It took a lot of work to put this together. Here we have our nucleotides. Here's the [ Indiscernible ] instead of thigh mean. If we were to read a messenger RNA starting here. If I said I had the UUG. That particular combination would produce trip toe fan. Okay? UUG, lieu seen, I'm sorry. Here is UUG. There's three, um, nucleotides that go into producing that one amino acid. Okay?
We have what are called start and stop codons. Every protein starts with ma thigh neen. At the start of every translation start site there's an AUG. Ma thigh nene is always first. There's stop codons. They don't code for any amino acid. Because they don't code for any, no more protein is produced, that's the end of the protein.
[ Class Participant Speaking ].
No. Actually, um, there have been, there's been lots of stuff done looking at transcription in electron microscope. You can see, it's difficult to capture because those are delicate processes. The binding of the proteins and things like that, in order to get them down on to a thing that you can see you disrupt a lot of it. You can capture the images of a messenger RNA being produced from a piece of DNA.
[ Class Participant Speaking ].
Yeah. Well, there was the hypothesis first. When people knew about the structure of DNA they said, okay, how does this play a role in proteins? There had to be a way to do it. Everybody knew there was lots of RNA in the cell. There was small pieces of RNA always present in every cell. I will show you, um, let's go back. We'll talk about, we'll talk more about the physical process in just a few minutes. So every one of these transcripts has a start codon. And a stop codon. So you are producing this string of amino acids from this codon set. I always use the single letter amino acid code, it's easier to think about three to one. Each one of these three letters stands for one amino acid. If you look at this you will see that there are some combinations like UCA, or UCA or G, there's redundancy in this coding table. You will also see that oftentimes, if I have, um, CGE, or any of these I will produce an argenine, oftentimes you will see what is called wobble at the three prime nucleotide in the codon. Because they can code for so many different things. So, the specification of that third base in a codon is not as important as the first two, usually. It can have a lot of mutations within a string of DNA in that third base pair position that doesn't change what the protein that it produces is. A lot of, go ahead.
[ Class Participant Speaking ].
Um, it's thought it might have been the other way. RNA is thought to maybe have been the fir nucleotide acid that was used for life. Thigh mean is simply, I think, matches with the double helix better than the urisil does. It fits in the combinations better.
Okay, so this redundancy means that we have, every cell normally codes for 20 amino acids. Okay? There's a standard set of amino acids that every cell uses. You have 64 codons. That's it's redundancy. Here the wobble hypothesis. I didn't know this slide was here. So, it says here if any of the codons are read backwards most will code for different amino acids. You have to know which strand you are reading from to know what you are going to produce as far as a protein is concerned. When you sequence a piece of DNA you have no idea weather it codes for a protein, second, which of the strands it codes for is coding for the protein. And third, which of the reading frames is used. You have lots of different possibilities. You have six different reading frames that you can read from on any given messenger RNA. We could start our reading here. At ACT and produce [ Indiscernible ] or lucine, depending on which frame we were in. Okay? If we're coming from the other direction the same thing, you're going to produce different amino acids in that chain depending on what reading frame you use and which direction you are going. Okay? So will are programs that will do translation for you, you can put in a sequence of DNA and ask it to translate for you. Generally it will start with all six frames. You have to look for what the most likely possibility from all of the six frames is, based on what you know about the function of the protein, and what you know about how many different codons you may see. Generally with any given piece of DNA the longest region of coding is the correct one. Because there are 3 out of 64 stop codons, you get one about every, on average, you get one about every 18 base pairs. So if it's not selected against you will see a stop codon. If you are out of frame you will get a stop codon more often than you get an he can tension of a long open reading frame. -- get an extension of a long open reading frame. It's just kind of statistics. Out of that group of coding, you have three different stop codons. If you are simply, if you are looking for an open reading frame it's easy to first discard reading frames because they have stop codons that won't allow you to get a long enough reading frame to produce a protein, because they're long enough. On average one every 21 amino acids.
[ Class Participant Speaking ].
If you are just taking a piece of a new piece of DNA and trying to figure out if it codes for something, first look for the first open reading frame, okay?
[ Class Participant Speaking ].
Right.
[ Class Participant Speaking ].
There are mutations called frame shift mutations. One nucleotide is deleted. You have shifted the frame, you have shifted that reading frame by one nucleotide. Therefore what that protein will be after that point is different than what it would have been.
[ Class Participant Speaking ].
It can, definitely.
[ Class Participant Speaking ].
Generally, um, generally not that often. Simply because of the fact, if you've got good sequence you will not see frame shift because you have the entire sequence. If you pick out the longest open reading frame from that you are generally safe. That's all bench work. This is all computer work. So --
Okay. Are there any questions about transcription and translation?
[ Class Participant Speaking ].
There are 20 amino acids in every cell that are used, yeah.
[ Class Participant Speaking ].
Yes, every cell needs all of those 20 amino acids.
[ Class Participant Speaking ].
Right.
[ Class Participant Speaking ].
It's not that you have to remember the 20 codes, or that you have to remember how the shift would work. Just when somebody comes to talk to you, you can kind of go -- oh, that's some of the vocabulary [ Speaker Faint/Audio Unclear ]. Not that you have no mem miz any of this. -- memorize any of this. [ Speaker Faint/Audio Unclear ].
[ Class Participant Speaking ].
Messenger RNA. Chris mentioned there's different coding tables. Generally those are for mitochondrion DNA. The chromosome of almost every organism used almost the same code for producing its protein. Mitochondrion sometimes use different codes, it's not a major difference. They sometimes use a different code for producing protein. Generally, one of the stop codons has been turned into a coding codon.
Okay. We've gone through that work to get to this point. DNA makes RNA, makes protein. The same with coloro plasts. It's only mitochondrion that seem to be different.
[ Class Participant Speaking ].
Yeah. Here's our goal, to produce a protein. Proteins are the functional units of the cell. They do a lot of the work. They make the structure. They're the enzymes that do the physical labor in the cell. So what are they? How do they get put together? A protein is a polymer of amino acids. It's a string of grouped together amino acids. What is an amino acid? It is a chemical structure that looks like this. Each amino acid has an amino group, an acidic group and an R group. It's the R group that determines the properties of the amino acid. It's the R group that determines the essential properties of the protein that our groups as they're aligned that determine the properties of any protein. Those R groups can vary a great deal. In terms of size and properties.
Here's our standard know men clayture for amino acids. You don't have to memorize this. Amino acids have different properties. They are usually based on several things. One is whether they're basis or acidic. Another is whether they're nonpolar or uncharged polar. Um, another is simply size. Some R groups are quite large.
Here you can see these are actually not to scale. Here you can see some of the differences in, tripofane has this large R group sticking out here, versus something like alanine which is here. It's a CH3, simply. Okay? [ Indiscernible ] and [ Indiscernible ] are the acidic. Again, you don't have to learn this. Just wanted you to see the variety of R groups here. Some of these things have the properties of actually affecting the structure of the entire protein. Something like prolene is an amino acid that what prolene causes the protein to be shifted in structure. Those kind of properties are good to know a little bit about, because, if you are looking at certain functions prolene can disrupt some structures. If you've got a substitution of a lucine, for a prolene, for instance, there's a lucine zipper, it fits into DNA as a binding protein. If you still a P in the place of a L it makes a kink, it can no longer fit into the DNA groove, it doesn't bind anymore. You can predict when you look at a sequence of an amino acid if you've got a snip, you know that it codes for something. You know that coding has changed the amino acid that would be there. You can predict some of the differents and predict why you might see a [ Indiscernible ] mutation within.
I told you a little bit about conservative domains. A domain is a functional subunit of a protein. We're going to talk some during this whole course about a protein called mute L. This is a protein that is i Kateed in colon cancer. It is a DNA mismatch repair protein. If you get a mutation in this protein you tend to accumulate mutations in that cell a lot more than you would. One of the conserve domains is this mismatch here.
We'll go into some of these. The Cn3D program is a way of looking at these. Here we have the DNA mismatch peridot main from a variety of organisms. So human, then [ Indiscernible ], mouse, fruit fly, [ Indiscernible ], the mustard plant, [ Indiscernible ] the bacteria that causes Lyme disease, you can see the alignment between those. This top one I believe, I can't tell, it's alignment. You can see the conserved regions within these proteins. That conservation is not based on identity, it's based on amino acid properties. Okay? Sometimes conservation is identity. Sometimes it's properties.
[ Class Participant Speaking ].
Okay. Um, let me see if I can pick one out that's -- here you have, so, um, at this position you've got --
[ Class Participant Speaking ].
This is argenine here. Except for this [ Indiscernible ] here. Those are, this is a conserved domain. You're going all the way from human to bacteria here. This alignment is covering a huge amount of evolutionary distance. Each line there is a different species. Yeah. Anytime you see an [ Indiscernible ] number or a GI number there's a new sequence. We'll talk a little bit how to look at the structure of a record and the results results that you see and how to look at the results, we'll discuss that later.
Proteins have structure. That structure is very important. They have four different types of structure. The primary structure which is simply the sequence in that polymer. You have secondary structure. Secondary structure is the difference between two different types of structure, ba that pleated sheets or [ Indiscernible ], simple structures within the protein. Tertiary structure is how that protein is folded in upon itself. Quatenary is how two proteins that are folded interact with each other. Oftentimes a functional unit may be composed of two or more different strings of amino acids, different proteins. Some enzymes will have four or five different strings of amino acids put together to make a functional unit. Hey mow globin is composed of four different heme globin proteins based around an iron center.
[ Class Participant Speaking ].
Some would call it a single protein. Basically, some will call this a peptide and this or this the protein. It can vary a little bit.
[ Class Participant Speaking ].
Um, boat of those, actually. And another one. Because of the physical properties of the amino acids, some are charged, some are not. They will fold in relation to each of the properties and the environment that they're in at the time they're translated. Some get translated into the endoplasmic reticulum are going out to be in the membrane of a cell. That environment for putting, making the protein is much, has more polar charges in it. It will cause the protein to fold differently than if it were just in the sigh toe plasm. You can't just take a string of amino acids in a peptide and predict how it will fold. We're getting close. There's a contest every year, somebody solves a structure of a protein. They don't publish it. They give that structure, they give that amino acid string out to groups who want to compete for this prize. They make a prediction of what that structure should look like. Then they bring out the structure and compair the results. Most of the time they're up to about 60% right. But I think as time has gone on we're getting closer and closer to being able to solve the structures from the amino acid sequence itself, but not yet.
You can .
Okay, so human genome includes about 30,000 genes, and as we've seen, it can produce more proteins than 30,000, because of alternative splicing. Gene expression, turning on and off of those genes is a very specific process that occurs in each cell, and that's what drives cell differentiation . Here you can see in a brain cell, we've got three different proteins, epinepherine,myosin andamylase. Epinepherine is expressed in the brain cell,myosin andamylas e is not. How many of you know epinepherine is a hormone, it's myosin is a protein that's found in your muscles generally, and am ylase is an enzyme that breaks down sugar.
It makes sense that you're going to produce this, epinepherine is a hormone that causes, actually is epinepherine the same as what? Anybody know? What is epinepherine some I've forgotten! It's a stimulant of some kind. I've forgotten what it is! But it makes sense that it goes to the brain, if you knew what it was, I'm sure that it would make sense. Myosin is produced in the muscle cell and amylase so your salivary glands produce the protein,mylase, it's secreted in your saliva and that's what breaks down the sugars,s sugars are broken down already in your saliva, where you swallow, so and NCVI has some very nice tools looking at gene expression, we'll show you those a little bit later on.
So genes can be turned on or off. Genes can be expressed highly or a little, and each of those things can be controlleded in certain ways to cause a cell to differentiate into a particular cell type or to cause it to function in a particular way. Here you see Gene A may be expressed in both muscle cells and nerve cells, higher in muscle. Gene B may not be expressed in muscle or nerve but be expressed in skin and Gene C may be expressed in all three. You've got different combinations of these 30,000 genes that are going to determine the unique properties of each cell.
You can also have different gene expressions at different times. Now, as I said, oftentimes, cancer cells express genes much differently than a normal cell does, and this particular gene may be prolime ks race here. You may have turned on them as one of the properties of that particular type of cancer to allow that cell to go through cell division.
it may be one of two things. It may be that because it's expressing in a cancer cell, it maybe that there's another gene expressing that it interacts with, that it normally wouldn't interact with in a normal cell or it may be that you spliced that particular protein so it's got a different [INAUDIBLE} in it.
Are we going to take a break? You guys ready for a break? Okay, what else do we have to do this morning? I've forgotten. It's time for a break!
{LAUGHTER}.
Okay, let's take, um, 10 minutes? Is that a good break?
Okay, I think we're going to resume here.
Okay, let's get started. Chris pointed out to me during the break that because we didn't watch the video, there are some terms that were used on some of the other slides that you may not have known anything about. If you see things on the slide that you don't understand what they're referring to, ask, okay? Don't hesitate to ask. If I don't explain it, make me explain it.
I think we should, hopefully we'll be able to watch the video after lunch, so you'll see at least see what you should have seen earlier.
Okay, we're going to talk a little bit more about sequence, but from a different way. We're going to talk about variation. Variation in sequence. Genetic variations are differences in the DNA sequence due to any number of things. Mutation is what causes genetic variation. Genetic variations occur in more than 1% of a population would be considered useful polymorphisms for a genetic link to analysis. Linkage means what chromosome a particular trait maps to.
There are different types of genetic variation. There are single nucleotide polymorphisms, SNPs, generally referred to as SNPs, and there are full databases of SNPs out there. There are small-scale insertions and deletions, and there are a great number of those. They work on looking at the relationship between chimpanzees and humans showed we were about 98% identical, but that was based upon did M a to DNA so taking human DNA to chimpanzee DNA and putting it together and looking at what percent a single one was not hybrid. However, that small scale insertion/deletion, there are a lot of those in both of the gee Nomes, and the hybrid did not show those small scale, so we're actually not quite as related to chimpanzees as we thought. We're about 97%. {LAUGHTER}. So it's a little bit lower but it's not too bad.
There's polymorphic repetitive elements, and that's what's used in many of the court cases that you hear about. Where they're looking for, this is used for forensic evidence a great deal. And there is what's called microsatellite variation. We'll talk a little bit more about that later.
Mutation is a permanent trans missable change in the DNA sequence. It can can be an insertion or a deletion and it can be an insertion or a deletion of single nucleotide or a large chunk, or it can be an alteration, and there are different types of alterations that can occur, and I'll show you some of those.
Here are some of the different types of mutations . You can get deletions of whole pieces of DNA, so it simply shortens that chromosome. That's a pretty major mutation. You can get duplications of entire pieces of chromosome, and you can get what are called inversions, where this piece of chromosome comes out and gets twisted around and reinserts in the opposite orientation, so it simply flips over. Whatchanges the relationship of the genes on either side of those insertion points , it changes that.
You have insertions where you take a piece from one chromosome and insert it into another chromosome, and you can get translocation where you take an end piece from one chromosome and insert it into the end piece of another, and oftentimes you'll get, this is called a reciprocal trans location. You've exchanged the two ends of these pieces of the chromosome.
So,alleles refers to genes, not whole chromosomes.
[Class Participant Speaking]
Yeah, it's that term is often used loosely, because it can describe sometimes scientists will refer to a trans location as an allele. It is not really that, it is simply a mutation. Allele refers to variation within a gene.
[Class Participant Speaking]
Yes, large pieces of the chromosome can be taken out, lost, reinserted in different orientations. This happens, because of a process called recombination. These two strands of DNA, the two pairs of chromosomes that you have actually exchange pieces on a somewhat regular basis, so every time a new generation of people is born, there are Newcomb but nations even within a single chromosome because of what's called recombination between those chromosomes, but in order for recombination to occur, you have to make breaks in the chromosome to exchange those pieces and sometimes if those breaks are not repaired properly or they occur at a time when the cell is undergoing a process that it, like pulling apart, you lose pieces of of chromosome where you can exchange because you have broken one here and broken one up here, and then it comes together in this way instead of reinserting as an exchange. So recombination is generally one of the big reasons that you get those kind of big chromosome mutations. A single nucleotide polymorphisms and the smaller mutations are usually due to, they can be due to any number of things like chemical, could be exposed to a chemical that could cause mutations, if you smoke it can cause mutations. Sun light can cause mutations, and you can see that with basal cell carcinoma.
So, let's take a lack at sequence mutations. If we look at what can happen in a sequence, so think of these as yourcodons, as they line up, " The fat cat saw the red dog". If we insert a letter in front of but we keep our codon reading, you see that you're going to throw off that reading frame a great deal. It becomes a nonsense sentence. And those are actually called nonsense mutations.
If we take the f out, you see that you also change the reading frame here and another nonsense mutation . Or if you translocate the E and the T, it only affects that one codon, but it does change that sentence.
Down here, it talks about the most common mutation insist ic phone rose us is the loss of a single codon, so it's a loss of a single amino acid out of a gene that contains 250 thousand nucleotides .
this is the expanded list of terms. We aren't going to go through these today. But you can use these to go out and explore on your own, if you want to learn more . And the Q arms of a chromosome, how many of you know what that is? Most chromosomes are differentially, it's not exactly in the middle of most chromosomes have you have a short and a long arm of a chromosome. Each arm is called, one arm is called QE and one arm is called Q, for short and long and the banding on a chromosome is numbered, so you have E21 would be the 21st band on the P arm of whatever chromosome they're looking at. I don't remember exactly, I don't use it anymore. It's not used that often. If you look at the map viewer, you'll be able to see which one is which. P is short?
[Class Participant Speaking].
Oh, right, right. I think they're counting from the centromere out.
[Class Participant Speaking].
Those are the bands, so if you look at human chromosomes generally they're stained with a stain and that stain is actually staining the DNA where it's compacted differentially on those chromosomes, and so you're picking up, it's different proteins that are aligned on that DNA and you're picking up the differential standing because of the proteins that are aligned there, and it's characteristic for each one, so it's characteristic for each species, so you've got banding patterns that you can identify . I don't know if I'll be able to show this. I guess not.
Okay, so here is your firsthands on experiment. Try using NCBIsORF Finder to translate the following sequence into protein. So open up your computers, get to this slide, .
[Class Participant Speaking].
Can you translate what the that means?
Yeah.
Something between a person?
So this is the human version of a Drosophila gene. Okay?
Why would you call it that?
Because when you do a sequence alignment, it was originally found in the fruit fly, and then they found it in humans based upon the sequence that was cloned Drosophila, so the name stands for, it's a circadian rhythm gene, so this is a gene that's responsible for a 24 hour cycle in the fruit fly. It's also responsible for how the frequency of their Winkler beat in their courtship mating. We have the same thing in humans and the mutation in the human genes do cause circadian rhythm problems so people who don't sleep well can have these mutations.
I'm going to Pub Med.gov first.
[Class Participant Speaking].
That's all that it's supposed to do. Go to NCBI up here. Click on the logo. Let's turn this over, I think. And then click on NCBI. Okay, so now go down to education on this left-hand side.
It's under education. And what we want is the Medical Library Association. And then we want the modules and then we'll go over here to the slides and we'll do --
[Class Participant Speaking]. What was that?
Oh!
There you go. So now what you want to do is open this up, it will open in a different tab, or a different window and you want to cut and paste that sequence into that tool .
Okay, everybody got a result?
Which one do you think is likely to be the right frame for this coding sequence? You can see how this is the longest open reading frame that was found, but there's stop codons that come up in the reading frames here and here and probably quite early. You have to have at least a hundred amino acids to draw something . This is actually, I was especially when I'm being recorded, {LAUGHTER}, I hate to criticize NCBI for their tools like this, but this is actually not the best one of these out there. There's lot thes, the European micro biology labs has a better translation program that will, it would give you actually the amino acid sequence so you could see, you get to pick which is the starting translation and it will give you the sequence right out of that for the message.
[Class Participant Speaking].
Right. That's because you're simply picking out that open reading frame here, okay? This is the -- we haven't put anything in here to identify this particular sequence, so that's why you're not seeing anything up here. It simply gives you back the sequence that you've got that you put in, but it puts it back in in a way in the standard N CBI format. So it doesn't identify it because we haven't done a sequence alignment with it.
Now --
[Class Participant Speaking].
You'll see why they don't do that. That's tomorrow, actually.
So, the these are the strands that you're reading on so you're reading from both strands so it's plus and minus strands with the Watson and Crick strand, whatever you want to call it, plus or minus coding and non-coding.
[Class Participant Speaking].
It's looking for codons, so all it's doing is translating those codons in every frame into amino acids and seeing which one gives you the longest open reading frame.
[Class Participant Speaking].
Well the database is what I showed you, that little table of of codons,
[Class Participant Speaking].
That's right.
I don't understand --
Those are blank because up here, it specifys that you have to have at least 100 amino acids in a contiguous sequence in order to get it to list.
And so the first letter, this link represents the whole link --
What you put in and so the first reading frame here would start with the codon with 1, next one would start with 2 and the next one would start with 3 and in those first two, it didn't find open reading frames that were at least 100 amino acids long. Maybe it's not 100 amino acids, maybe it's a hundred nucleotides long, but that's not a very long sequence.
Okay, any other questions about the open reading frame finder? So we're looking to answer, first you want to answer , does this piece of DNA code for a protein. Right?
[Class Participant Speaking].
Yeah. She's, she just wants to know exactly what we did and why we did it, okay? So we're looking to see if this particular piece of DNA can code for a protein, okay? That's the first thing that we want to know. This says yes. Okay? So when you stick it in there, then it gives you that graphic back. what this means is okay, so in these first two, let's see if we get anything different if we change that so in the first two, you had stop codons that kept you from reading across all the way, so there were no contiguous sequences that gave you that length that was required for the minimum to put it in. I changed it down here to 50 and so now I can see more. The reason that we think that this one is the best is because it's the longest out of those six. So this is in a continuous length of amino acids starting with this nucleotide which would be probably, I have to show you, which would be this right here and reading in this direction. Okay?
[Class Participant Speaking].
Right. And that's an important point. All of this, the work that you do on the computer, is simply going to help you determine what experiments to do in the lab. These are not answers. These are not things that you can say, okay, this is true and this is what I wanted. You have to go back to the lab and verify it. No matter what you do, in Silico, you've got to go back to the lab and verify it.
[Class Participant Speaking].
Right. So this would be the first nucleotide, the second nucleotide , the third, where you start the reading frame or from this end, okay?
[Class Participant Speaking].
It's the right one? Then you'd take that, your sequence and you can take that amino acid sequence, you now have an amino acid sequence that's that length. You can run it in lots of different tools to determine the properties of those, of that peptide sequence. Okay? I'll show you, we'll show you some of those tools as we go along, and you'll see the kind of things that you can do with it.
So, there are three fields, the fields of study that have come out of a lot of this reent work. Genomics, proteomics, and bioinformatics. Genetics has been around a lot longer, it's one of the basic fields of biology but these other three are fairly recent and fairly new. fwchltenomics is the Study of not just of single genes but of the whole function and interaction, and out of this kind of capability, has now come a new field called Systems Biology, where you're starting to put together all of the gene interactions into pathways that you can look at, and how those pathways are regulated so Systems Biology, you probably, you may have a Department of Systems Biology now at your place of work.
This is the definition of genetics, the Study of single genes. Okay, proteomics is a small set of proteins that are encoded by a Genome. Bioinformatics has lots of definitions and it can be an extremely broad term because it's an extremely broad field, and a lot of people now refer to bioinformatics, not just the molecular biology part but they're bringing in medical as well and calling it biomedicalinformatics. We now have a Center for biomedicalinformatics.
{LAUGHTER}.
[Class Participant Speaking].
The Director of the library, takes about the bibliome -- bibliomics.
{LAUGHTER}.
Because he does a lot in the literature. So if you look at bioinformatics, it's an interesting mix of things because there are the people who are on the hard side. They build those tools that everybody is using. They do the analysis to put together those tools and put them out there for people. Bioinformatics are actually one of the probably most open sourced dproop of people that you'll find anywhere, because almost all of them build their tools and just stick them out there on the web for anybody to use, so there's now probably close to 4,000 different databases out there with different things in them that you can use, all of these different tools to analyze those databases. People are just sticking things up all the time, and it's really cool, but it's also very complex and knowing what of those tools are good and how to use them. I think librarians will be very good at, if we can get people up to speed in learning how and why people would want to use these tools, so we're good at analyzing things. We're good at looking at how databases work and why they work the way they do, and I think that's one of the areas that could really help people is to put up a set of these tools on to your websites that people, that you recommend for people to use, and I'm starting to do that. It's hard because they change so fast, but I think it's important, and I was actually thinking about doing a survey of my faculty just to see the variety of tools that the they are using in their own labs, and put up a list that says, " Here is the tools that Harvard Faculty it use".
And also the other side of things, where the majority of people who use these tools have no idea how they were built. It's like most of us driving a car. You don't know how that car runs necessarilyly you don't know what all of the computer pieces are that go into making that car run. You don't know how the car ksburator works particularly, but you still can drive it, so I think what librarians kind of have to act as is the mechanic. We have to be there when they have to understand what the tool is doing so that because they're running into a problem with it, but we have to be able to interpret what those builders meant for that to do and what the people who are using it are using it as. So we're kind of the independenter mediary. That's how I place myself in this area .
Okay. This is the end of this section. Remember, we covered this this morning? Yes, okay. So we're going to do quickly, types of databases. We're running behind.
at NCBI, there are and out there on the web, there are lots of different types of databases and if you think about it in one way, you can put them into these data domains. So they're based upon what they contain, and so it's the same kind of thing that we've already looked at, nucleotide sequence, protein sequence, 3D protein structures, so we've looked at all three of those things and complete nomes and maps. So these are our domains and you now also have gene expression which is becoming much bigger. I'm sure most of of you have heard of microarrays at some point, so lots of labs are running microarrays for thousands of different reasons. A lot of the gene expression data is coming from those types of analysis with microarrays, and they're looking at SNPs, genetic variation, polymorphism, SNP databases.
You can also think of them by the scope of what they cover. So these three databases, Gen Bank, EMBL, and DDBJ, are part of a international consortium that pool their data on a daily basis so they synchronize their databases daily to make sure that everyone contains the same information. They do different things with that information, but the basic database is the same across all three of them. And then they build their tools to do the analysis of what's in those databases, so Gen Bank this year is going to celebrate its 25th anniversary. It's been around for 25 years.
You also have protein databases, so protein is NCBI's tool. If you go to NCBI, you can see here on the drag down list, you've got protein, nucleotide and core nucleotide, okay? The core nucleotide database is the one that you probably would use if you were looking for a messenger RNA for something, although I have to tell you, we're going to spend a lot of time on these different database, but there's one database that you should no and that's Entre Gene. That brings all of this other stuff together, and I think if you don't know any other databases , if I go to -- I misspelled it.
The reason that you want to concentrate on Entre Gene and direct the people that are asking the questions, is because this is where NCBI is really doing its work in curation, in putting in the metadata that people need. If you look at the history of Gen Bank, Gen Bank started off as a place where people could deposit their sequence data. So that meant that anybody could put sequence in there. There was no control on what the metadata that was added to those sequences were, and over time, it's kind of become a mess. If you do a search in Gen Bank for something, you're going to find lots and lots of redundancy, lots of sequence that may not be as good as you want it to be, so there are problems with Gen Bank that NCBI is now in the process of fixing, and the way they're doing it is to go in for each one of the genes, pull out what they think they consider to be the best representative sequence for that gene, or set of sequences for a transcript that has multiple variations, bring it in here, put in the coding sequence, put in the protein sequence, and link it with all of the other database s that they have over here. Plus, lots of outside databases, so if you go here, you'll see what I look for was the amino transfer, which is an enzyme that deals with amino acid and it puts amino acids on to transfer RNA and we haven't talked about, I forgot to talk about how translation actually works but we'll get to that when you see one of the videos later.
So, if you look here, you'll see that the name that came up is actually " Got one the" so we got one. And it brings in, so it talks about here is the organism, you can link it out to the taxonomic database from here and it tells you, to see the related sequences, the related genes in different databases. It tells you the entire lineage for any organism that's in the database. It gives you a summary of the gene function right here. Any other name that it's known as in the database here, so if you search this by an old name, you should come up with the new name and all of the information about it. It tells you the genomic structure of this gene, it gives you the chromosomal context of the gene, it tells you which chromosome it's on, which band it's in, and what the relationship is to other genes surrounding it so here you see the direction of transcription, relative to other genes surrounding it, and one of the cool things that I like is gene rifs which is Gene References Into Function. Here they allow authors to put in, if they feel like they've got a paper they want to submit here to talk about the function of this protein, then they can put it in here and NC BI looks at these papers and decides if that's appropriate and they pick them up for people to see. So each paper that's here is supposed to actually talk about the function of this particular pro P teen.
They have links to markers that are on the chromosome surrounding this gene. They have links to pathways and the KEG pathway. It stands from Kioto encyclopedia of genes and genomics, and --
[Class Participant Speaking].
This is a pathway, so you see, you can see that you can get into the past ways.
[Class Participant Speaking].
{LAUGHTER}.
Actually, they're over here. Do I have to go back to that?
Yeah.
Okay, I see it. Okay, so I think, I'm using Fire Fox and the default on this computer is Internet Explorer.
Yeah.
Okay. We'll go through more of this like Chris said later on, but these are tools that your faculty it first of all aren't going to know are here for the most part. They will not know this database, because it's fairly new, and not a lot of people use it. They go to their regular old, look for the nucleotide sequence and there's nucleotide database and that's what they bring up, okay? But those sequences can be brought up, switch back and fourth, is that goingto cause a problem? Can't you just open this up in Internet Explorer then? {LAUGHTER}. Doesn't matter, I lost it anyway.
[Class Participant Speaking].
Thank you.
Have them call.
Okay, we're back. So, we're going to go through some of these other databases and I want, what we want you to learn from these is what kind of information is in them and how they have been treated as we go through, so that when you come to a Resource, you you can choose the correct database, because that's part of knowing how to use a tool. It's part of the searching that the most, again, most of the faculty have no idea that there are different databases behind each of these tools that they can choose, and that they will get different results from.
Okay, so we have the protein databases, Entre Protein, swit-Prot is a database from EMBL originally put together based upon amino acid sequence that had actually been proteins that had actually been sequenced as the amino acid sequence rather than as a nucleotide sequence and translated. So, it was the prime database up until it couldn't keep up anymore because people aren't sequencing amino acid sequences anymore. So now, it's called Swiss-Prot which stands for Swiss-Prot, the original database plus translated EMBL database so they put the translation back into this database but it's still a very high quality database of protein information.
The protein information Resource is also quite good and U ni Prot is also quite good so the protein databases tend to be very well characterized and fairly good, but I think again, we're going to show you some databases that you're going to want to let your faculty know about for them to go in and pull out the sequences they're going to use.
There are specialized databases for protein structure, MMD B includes everything that's in the protein data bank, plus data from some other places that's pulled in, and it's the molecular modeling database. This is what's used by NCBI to you'll up its structures when you look for protein structures. And Entre Genome is the only one talked about here but there are lots of databases out there as well and I think we'll probably show you a couple other ones that there are particular aspects of NCBI's database that I like and there are particular aspects that make it very hard to use, and so we'll switch back and fourth between the different tools that are better for particular purposes.
[Class Participant Speaking].
If you look at, I'm going to actually go back there, if you look at, if you go to NCBI and you pull-down the list here, it's about quarter or half the way down on that screen. So you can see that you can pull that up. Any of these databases if you want to see them just as the entrance point, if you just hit the return in the search box with that up there, you'll see the entrance point to the database.
There are also organism specific databases. You'll see lots of those out there. And I'm sure you've seen some of them. We were just talking about fly base at the break. That's one of them. And there are categories or functions, these are TRANSFAC and the Vector database, TRANSFAC is a tool that you can look up where a transcription factor would act if your r gene may be functioning in, which impact may be regulating your gene of interest, it will tell you sequence information about where a particular factor binds and here are some of the terms that we missed from the video. EST stands for Expressed Sequence Tags. GSS: Genome Survey Sequences. Sequence Tagged Sites and high Throughput Sequences. We're going to look at what those are and where they come from and why they're special.
[Class Participant Speaking].
The other issue is theres a public version and a commercial version and some are licensed for a more commercial version, but you can use it with other groups, so if you do selection in the library and you're wondering about selection development, [INAUDIBLE}, which package is the decision, and until you start using the public ones you may not know what the difference between the value of the public and private is, so it could be that you need to find a register and explore that because eventually the question will come, well what am I missing by not having the other version and if you're doing collection development, you want it there.
I think that's fine. That's one of the things also we'll talk about on Wednesday is theres a lot of, I mean on Friday, {LAUGHTER}.
{LAUGHTER}.
There's a lot of interest in what tools, what should a library be supporting, and so if you are a library that plans to do bioinformatics support then you might think about licensing some of these databases because they are among the best of the tools that are out there. It is, every January, there's a list of databases, no, is it every January?
[Class Participant Speaking].
January is the database list, and then the web resources is in July I believe, first issue in July. I can't remember.
[Class Participant Speaking].
What the?
It's the web resources that are out there, so tools that are out there for free that people can use, Biointomatics. January is the databases, July is the web Resource tools .
So some of these things, like ESTs are not talked about quite so much anymore as they were originally. Expressed Said Wednesday Tags comes from the fact that you can take a messenger RNA from a cell and using an enzyme called reverse transcription, turn it into DNA, and once you've got that piece of DNA, you can clone it, and you can sequence from the ends of that message, okay? So that means that we've got tags now for that particular cell type. We can use to go in and we can do a number of things with that. One is you can can look at all of the things that are expressed in that particular cell type from the sequence. You can also use those as markers within a cell so that you've got a molecular handle on a particular place within a chromosome and these were used a lot in the beginning of the human genome sequencing project.
The same with sequence tag sites. These were short pieces of sequence that you could go in and use as a way of getting into a particular region of a chromosome. Okay? So, originally, the ESTs were included in Gen Bank as part of Gen Bank, but because so many ESTs were put out there, it used to be that when you searched, you did a search against Gen Bank, all you would get back was ESTs which aren't necessarily that useful for a general purpose thing so they put the ESTs into a separate database now, so if you want to look for ESTs to use as markers with a chromosome, you can use the EST database or the Sequence Tag sites that are also in that same EST database.
This is the most important part of this whole thing is the level of Curation. So you have two different types of Curation that are here. You have archival data, which is simply a repository. It's the place where everybody puts things and that's Gen Ba nk. With an archive of everybodies sequence that they stuck in there along the way. It's redundant. Even though when you look at, I can't show you right now, we'll show you tomorrow, if you run a search in the nucleotide database or in the protein database of NCBI, it says nucleotide NR, which stands for non-redundant, but that non-redundant is a historical term that comes from the fact that they only let each lab put in one copy of that particular gene sequence. So it was unique to each lab, but lots of labs were doing lots of genes and there's lots of redundancy in there and all of the different projects are in there, so you've got ge nomic DNA, Messenger RNA, you've got EST sequences, you've got lots of different redundancy in there, so Gen Bank is very redundant and that's one of the problems with it and as I've said before, there's also no controlled vocabulary. There was no one controlling that metadata that people were putting into there, so if you search in Gen Bank by keyword, you almost don't find things because people don't put tags into those sequences. They weren't interested letting people find things that way. They were interested putting the sequence in so they could publish their paper.
And there's a lot of variation in what is included in terms of an notation of that gene, because one lab may think it has one particular function because that's what that lab thinks about and another lab may find the same sequence that has a different function because that's what that lab is concerned with, and so there was a lot of variation in what was said about those different genes. But NCBI is putting together this non-redundant, one record for each gene or splice variant database, called Entre Gene, and the database itself, the database of those sequences is called Reference Sequences but those are the best sequences that NCBI has pulled out of this whole group of sequences from Gen Bank.
Each record is an encapsulation, so it brings all of that information together and the records contain value-added information that has been added, so like the gene rifs is one part of that, and Entre gene links out to other resources that NCBI doesn't have. So, Entre Green is an excellent Resource to start almost everything with.
This site says there's hundreds of databases out there. So here you can see that these are all databases. And this is just one of themetasites for databases. Deciding what to use is part of I think what we should be doing for our faculty. Okay, I've said most of the stuff here. You can get more information about what Gen Bank is. Not a the lot of people are doing this anymore, but you can submit things still to Gen Bank, and there's two different programs that you can use to do that. Band It and Se quin. Bank It is used by labs that are doing small projects and Se quin is what is used by genome project labs and Se quin is probably the one being used most now days. Very few people are doing actual sequencing projects for small, individual greens anymore.
So in terms of individual labs doing it, it's mostly large scale labs doing Genome projects that are submitting .
[Class Participant Speaking].
Depending on which one of those commercial databases it is, so, the protein library actually has PhD scientists reading papers, reading the primary literature and adding the annotations to each one of them. That is a model that is, it's essentially the chem abstracts model of putting a database together. That's used by both, by lots of different companies now, xwchltenuity is using it, Gene Go is using it. Well, they're taking the sequence information, that kind of thing from NCBI, but they're taking the primary data right from the primary literature, so they VP had did scientists reading each of these, reading the papers and annotating those things.
REf Se q is the database to use if you want to get probably what NCBI considers the best said Wednesday information, and they've used the Ref Se q record to pull together a lot of that so if you don't go talk to a gene, go to the the Ref Se q record for a gene protein, or MRA, so you get, you can recognize Ref Se q succession numbers by having these two letters, so this stands for NCBI protein, and then an underscore and five or six numbers, okay? So the Ref Se q record is distinguishable by its succession number.
The Ref Se q records are accessible like most other things that NCBI through BLAST or in a variety of other ways.
Entre Gene is a place that NCBI has put together the information about a particular gene and the protein that it codes or whatever that gene does. Rep Se q is the database of those genes and proteins. Okay? So if you look at an Entre Gene record it's got things all about that particular gene, and if you look at a Fer Se q record, it's got, let me show you the difference. I think it's easier to do this.
So here is the Entre Gene record. It's got all of this type of information so it tells you about the structure of the gene itself, it tells you about the map context and the chromosome, it gives you all of this other information, but down here, you've got the reference sequences and this record, so here is the protein record for that particular sequence, and now what you see is a regular E mtre protein page, but the difference between Ref Se q and any other protein that's in this database will be there's a lot more an notation here so you'll see you'll have a long list of papers about this, you'll have a summary of the function of the gene, you'll have lots of features that are listed here, and the sequence. Okay?
So Ref Se q is the database of the genes and proteins itself, the physical sequence.
[Class Participant Speaking].
It's more basic. Entre Gene brings all of the information about that protein or that gene into one place. Its function, it's all of those thing.
.
Yes. Yes.
[ Class Participant Speaking ].
I lost my -- okay. RefSeq contains, um, all of the different types of molecules we've look at. It has the Ginoic sequences, the protein sequences. And it's got, um, [ Indiscernible ] models from. I showed you that model of what the gene structure is for a couple of those genes. We'll go into more of that later on tomorrow. There are status codes. Because NCBI's still in the process of building these data bases there are different status codes for the cur ration. RefSeq is a curated database. And you have statuses, just like in pub med, when you see a record comes into pub med new it says in process. Later on it might say indexed. This is the same. You've got the provisional. You've got the ones that have been reviewed. So that is the same as being indexed. You may have a predicted, actually, for the structure of the particular gene. So sometimes the gene structure is not known because of the variance in splicing, there can be question about which ones are real and which aren't, things like that. Genome annotation, once you get to genome annotation you've picked out that region. You've got to the point where that's a very well curated record. [ Indiscernible ] is when they're first starting to put things together. It's not complete. A predicted maybe that the only thing that is missing is the gene structure within the molecular map. Okay? So provisional is a beginning, they're asking for bringing in the information, still reviewing what is going to be put in there. [ Indiscernible ] is more that they may not know exactly what the answer is.
[ Class Participant Speaking ].
Oh, yes.
[ Class Participant Speaking ].
It still is, actually. Um, we're very close to having it complete. But there's the thousand genome project going on right now. So they're trying to now go back and sequence is thousand different humans. Because what we had when it was declared complete, we had five-fold coverage. Because of all of the different, the way that sequencing is done, and because of the repetitive DNA in the genome, it was difficult to determine order within the genome. And figuring out because of that redundancy where particular things may be in relation to other things. So five-fold see consequencing is not actually enough to get a complete map of the genome.
I think what we'll do is go to lunch. We'll try to get the flash on to these machines to show you the videos when we get back, okay? What do you think, Chris? Let's make it 1:30 so we don't get too far behind. Everybody back here by 1:30, okay?
Session on lunch break until 1:30 eastern time zone. Session on lunch break until 1:30 eastern time zone. Session on lunch break until 1:30 eastern time zone. [ Laughter ]
As you can see we have sound now. We're going to start, there's two videos we'll have you watch. Part of what I covered this morning [ Speaker Faint/Audio Unclear ]. You will see a lot of that in the video. If you have questions after the videos, feel free to ask them. One part that I skipped is the adaptation part of translating, um, messenger RNA into proteins, there's an adapter that comes into play there, it's called transfer RNA. Will you see how transfer RNA works, what it does in one of these videos. Can't get too close [ Speaker Faint/Audio Unclear ].
[ Interference ] [ no audio for the captioner to hear, video closed captions are running at the bottom of the video ]
How did they address the fact that some of these samples of people they had might have had mutations or defects? That's one of the reasons they wanted to get a combination of different people because --
[ Class Participant Speaking ].
Different people.
[ Class Participant Speaking ].
I'm sure that --
Yeah. I mean --
It is a small pool. First of all, the purpose of the human genome project was to get the genome sequenced. It didn't matter about the variation. We wanted to get a baseline of information. The reason that it doesn't really matter, if there are mutations, every generation there's new mutations that arise. You all have mutations because of the frequency of, yes you, because of the mutation that occurs due to the errors of DNA prlimarase, everyone has at least one mutation, if not more, every time you replicate a new cell. There are going to going lots of mutations in these genomes. If they sequence at least five people they'll not all have the same mutation in the same gene in the same place. You look at the variation there. You try to make sure that what you've got, one person may have a defect at the place, the our four might be the same. You look at the consensus, that's what you draw as the normal. Okay? Any other questions?
[ Class Participant Speaking ].
What did we just do?
[ Class Participant Speaking ].
How did we get back to that?
[ Class Participant Speaking ].
I'm going to go back to the modules home page to start from fresh. You guys will do a lot of interactive work this afternoon. You might want to do that yourselves, as well. Real quick.
Yeah, I don't think any of your computers have the module's address set up on it. If you go to the NCBI home page, on the left there's a link to the education. It's on the first education page. You can also Google the title of it if you need to bring it up. Get that off of screen. All right, is everybody kind of vaguely getting to where we want to be? You're not connected at all, either? Okay. I'll get started up here while this works. You can feel free more of this. We'll go over the format of one of these Gen Bank records. We've been showing you little pieces of various records. I think seeing a whole one and going through it step-by-step by step. Being familiar with the fields and the data that goes into them. It will be a good orientation for us. The other thing is in the NCBI site map, if you come away with nothing else, at some point you take a look at the NCBI site map. That's how they try to structure adding the information that they have available. This Gen Bank sample record that we will go through is an entry in their site map. Where did we get to this, you can find it from NCBI's sight map. I will
So, this is NCBI's homepage, and on the left-hand side, do you see that it says " Site map"?
So there's all kinds of different Resource tools here, let me just maximize this for a minute. And what I am actually looking for is where they hid the sample record now. Let's go to the alphabet it call list and there's the Gen Ba nk sample record in that list. So for some reason, today --
[Class Participant Speaking].
Yeah, I think the alphabetical list is easier but you have to remember it's a xwchlten Bank record as opposed to a sample record or something else
This model is very straightforward and because it seems like I'm showing the module to the audience I'll be jumping back and fourth between the module window and the sample record window, I think it's better for all of you if you have a connection to go ahead and open the live record here and just you work with that window and the modules windows will be up here, so if you want to check up and see if I'm saying what's actually on the slide, you could certainly do that, but that you mostly just have the record in front of you because that's really what we're going to to use as a tool, and I think if I keep moving back and fourth between the two things I'll make you a little bit crazy.
So basically, there's really three sections to the record , and this first part that's, these groupings, I think are going to be a little bit deceptive because even though they 're calling them the Header and the Descriptors and the Sequence Data, the record you're looking at on your machine doesn't really call it that. So there's no search the header kind of section. There's search the title in terms of all fields for example, so the title in all fields is really this definition line that we're going to poke us on here and let me go ahead and open the record. So we're going to go through all of this detail and I think David had mentioned earlier that naming of these records, we're going to show you, they're based on what the scientists put in when they made their submission. We'll show you an archived record which is just literally reflecting what the scientists that posted to GenBank put into the record, and so in terms of all of the value adding things he was talking about that they do in EntrezGene or in Ref Se q, this first record we're focusing on isn't that kind of record. We'll get to showing one of those kind of records a little bit later. So we'll go through the specifics of where it came from, what organism it came from, which scientific features the researchers did bother to an know Tate because that's what they new about at the time, but as people they work with control L their could vab you Larry and adding value and all this, you may be looking at some of these records and thinking they could have added more keywords that would have helped or the keyword field might be completely blank, so you'll probably tap into kind of what the some of the informatiogaps in these records are and then we'll have the sequence itself and we'll have a look at that.
You notice here that it says in the archival database, all of the sequence should be experimental sequences, not the kind of hypothetical of predicted sequences that you all had a little bit of a discussion earlier of when you were talking about Ref Se q and you were talking about provisional vs. Predicted. Does everybody remember back to this morning a little bit? all of these are archival pieces where the scientist submitted the experimental sequence. These are not the predicted things that we're talking about.
So, let's just go ahead and look at this record and we're going to talk our way through this record, but this also happens to be the record that is that GenBank sample record, so if you forget what it is anything that we've talked about , if you actually go to view that sample record, what you'll see is that all of these things will have little highlighted things that you click on and it will explain what it is that you're looking at in the sample record; okay?
So don't worry about all of this up here. What we're going to do later this afternoon when we go over how you search EntrezGene and how you move back and fourth between the various databases, we'll talk about all of these different search features and pull-downs and what these things do, so for the moment, leave that aside and just focus on the piece of data that's in front of you in terms of the record.
So, let's talk about this first line. Can anybody identify anything up here that makes sense to them, based on what you've learned so far this morning?
Yes?
[Class Participant Speaking].
The one you're looking at has how many?
1028 base pairs, okay, but what kind of molecule is it that we're looking at here?
DNA.
Right.
Is there anything else that we can can tell from up here?
Yeah, it has a date, but we're going to talk a lot about dates. Dates are something that the we have to be very careful about. A lot of people are submitting to this either in relationship to as David was talking about publication or in some cases, depositing before or committing with their patent related work so NCBI is very sensitive about how it talks about dates so we'll come back and talk about dates much more.
Okay, so you've got basic information there. This definition line up here, this is what they're talking about when they talk about the title field so if they talk about limiting your search to the title field, most of the time they're talking about searching this definition line, okay? And a lot of what you'll see in the definition line, it's completely, and actually all of what you see in the definition line is completely unstandardized. I do think there's an effort to try to be more standardized. Let me not say that. Let me say that there's no absolute requirement that it be presented in a certain way. I do think there's certain conventions of behavior that people definitely try to follow. This is not abnormal to what you would see in the rest of the database, but there's nothing that says, for example, that you have to have the short name of the gene in parenthesis after a different name of the gene, as you're seeing here. So what you have here SUV the organism and some information about the gene, and different genes within it, and as I said they give you the longer name and the short name, and they said for some of these you have partial coding regions and for some of these you have complete coding regions; okay?
So some of this is a repeat so you see this is the organism but then you also have an organism field and in this case, you see that happens to be the same name. This confuses people quite a bit when they try to, and I will say something about searching real quick but we'll come back to this, people keyword search, you know, they say whatever gene, and human, all the time and they will think that means they're searching and limiting to the human organism, but many times they will have a gene that actually , you you know, it's a mouse gene but the organism, it was from, the organism that the actual sample was from are completely different, so we'll show you some examples where you can can search on colon cancer, human, but you come up with mouse things and it's only because in this definition line part of the naming is that the it was a human gene this , but it happens to be to the mouse and so the mouse gene name has the word human in it and vice versa so there's definitely a reason to limit your searches to organism if what you really want is the rat gene or the human gene or whatever your focus is.
Organism links to NCBIs taxonomy. Have any of you used NCBIs taxonomy before? Okay, a few of you so it represents only the taxonomy for organisms that are actually in GenBank. So there are many many many many organisms in GenBank but certainly not everything, so it's not a complete taxonomy and those of you who do a lot of biology reference probably know of other kind O of more complete tools, but for the ones that are in here, it's very connected with all of the other resources. In fact, since most of you haven't seen it we'll just pop over for one second.
So this is the taxonomy browser and we're not going to go through it now but I just wanted you to be aware that that's what it looked like and it's not a complete taxonomy. I think we'll see a little bit more of the taxonomy when we do the BLAST links tomorrow. I think there's a little bit of the taxonomy connection there.
So this is all of the background information about what you have. You notice the keyword field is completely blank, so they didn't use all of the opportunities they could have used to tell you more about what techniques they used or if they had any other retrieval help ideas for you. We do have multiple versions of this record and we'll talk about version control in a little while. Once a scientist submits to Theophano can always come back in and make changes and additions and corrections to their record so if they later on publish another paper they want to link up or if they sequence a little further along than they did before, that's going to be a different kind of addition, so there's lots of changes they could make to the record. So somebody could come back and look at this and go well, I fael really guilty I didn't put keywords in the first time. Let me put keywords in this time and they would be able to do it. Everybody else can't do that for them. If somebody else did it other than the initial person submitting it, that would be one of the third party annotations that I mentioneded this morning that we aren't going to talk about.
Okay, so this submission has several references and the literature and you see each of the references relates to the whole length of the sequence. Sometimes what you'll have is you will have a reference where the reference was only talking about the first part of the sequence and then later on they had one that dealt with more of it but in this case, it was a whole set that Jennifer, right? The whole set of base pairs that Jennifer mentioned back on the subject of the papers involved. And these are all hard links, so if you were in Pub Med, these links of course will take you out to the references for the article but if you were in Pub Med looking at the reference for the article, all of them would have hard links back to the nucleotides database so that you could see this record, so that's that one-to-one linking.
In terms of how this got into the database to begin with, you can see that it was a direct submission from Terry Romer s lab at Yale, and you notice hat they've forced that direct submission moo the title journal author kind of framework even though it wasn't published at that time. And this makes sense because most people have to submit it before they publish it so they don't have originally the paper citation and then once it's published they go back in and say that it was published in science and I'm really proud and here is the citation and it's all good. Okay?
So, this is kind of all of that background information and then you get to the more [INAUDIBLE}, and this feature area could be really rich, depending on how much work the scientist has done on this sequence, or it can be fairly plain. Now, here, remember we said this was a record that had multiple genes in it because we knew from the definition line there were several genes involved? Everybody with me on remembering that? So now what you see is that you actually see the information about the different genes, so here is an example that out of that whole range from 1-52, between 687 and 3158 is this, does everybody see that, so this morning when your resources kept offering you that to and from number, those to and from boxes, does everybody remember that, where you see to and from? So like if I was working with that gene and I knew it was 687 to 3158 but I didn't want to try and copy and paste out just the part from 687 to 3158 I could have pasted in the whole thing and said only look at the part from this place to this place and in this case you already know what this is so you wouldn't be taking this but I'm just using that as a to and from example. And later on, when we start working with Entre, you'll see that you have a range box up here from begin to end, so if you were saying, look, I only want you to look for things that look like that piece of it, you could type in the range of it so that you wouldn't have to deal with the part that you don't want. Okay? And there's many many strategies to get at that. That's just one of the ways you could do it, okay?
The other thing that these do, you see all of these little parts that start with the back slash some these are all NCBIs way of labeling these features and this will become important to your searching later because things like mall type and all of that, when you look in the preview index, how many of you have used preview index in Pub Med? When you start to look at the names of things and how they label them, it will be helpful to kind of cross reference between what they're talking about, so you just kind of start looking at how they talk about things. What else do we know about this from the information and features? I'll give you a minute or two to look at it and see what else jump ps out at you as what we into about this from this record great. And what else do we have, what are we looking at here? We're looking at a translation, so we have, when we jump down you're going to see that we have the original, and there's quite a lot of it, here is our 5,000 base pairs of A's and C 's and T's and G's we were expecting because we're in the nucleotide database and we know that this is DNA because the definition line said DNA, everybody remembers that? So, you would expect to see all of this data, but what you also have SUV information about the protein translations, you have links for the protein record, do you see where it says protei ID, and then you have the actual translation.
Here is the other gene. Remember we said there was also the Rev 7 gene, the second gene that was in parenthesis so we have that here as well and you can see that that is actually a smaller section towards the end of this whole string of sequence. Okay, is everybody comfortable so far with what you're seeing? Yeah?
[Class Participant Speaking].
Okay, that's a great point so Cara was asking what does it mean when it says " Compliment" and this goes back to David talking about or showing earlier about the complimentary DNA, and so I think this has to do with the reverse being on the reverse strand, so I think the previous one was on, yeah, the first one was not on the complimentary strand, and this one is. And right now, it's not such a big deal as we go through the record but you see up here at the top, you can reverse the complimented strand and you can actually, if you carry this moo some of the other displays, you can see which gene is on which strand and which gene is on the other strand. Do you want to add anything to that?
It can be both. Either they don't know what the rest of it is is or the rest of it could be, you know how he's talking about introns and exons? Like these genes are exons but what's on either side of it, it could be a lot of different things.
[Class Participant Speaking].
That's true too, right. So I'm just going to repeat what he was saying that since we're looking at DNA and not at messenger RNA, we have extra sequence which could be introns or it could be untranslated but that's a good question. And so okay, so one of the other things that I wanted to point out that came up this morning, we were talking about pasting in lines of sequences to things, and about whether people should just be copying and pasting and how your sequence was wrapping and you see that in the default GenBank display, they try to number these lines of sequence to give you a ballpark if you're trying to eyeball where did that 68 start, you can kind of start with 61 and kind of look over, but if you were actually copying and pasting out chunks of sequence and you you copied and pasted this and left the numbers in there, you would be causing problems with the tools, so what you want to do is you want to actually take it in the fast A format and you'll learn more about fast A tomorrow when we do BLAST, but basically what this does is it takes out all of the numbers and don't worry about capitalizing versus not, it's going to make no difference for this but you get two things. Does everybody see that little care" at the beginning? That is basically, this is a definition line an it just ignore, because the care " Is there, it knows not to treat it as part of a said Wednesday, it's like HTML, when you put something in the brackets it knows that it's a tag and it reads it a certain way so that thing we did this morning that said " Anonymous", it said that in terms of your finder results because there was no definition line, you just gave it a piece of unnamed sequence and said " Here you go". Had you had a definition line that started with the little c aret, that said sample fly sequence and you started it with the little caret, then this would come up and say " Sample fly sequence" so this happened to me a lot doing reference as people would come to meet with me and bring their sequences and we would start putting them in and comparing them or searching them and they wouldn't have definition lines on them and I'd be trying to keep track of was that file, you know, sequence 1.txt or sequence 2.txt, so encourage people if they bring you something down that doesn't have is a definition line in it, have them at least throw in a caret with something so that when you get your results back and you see, you know, whatever they want you to call it, it doesn't matter what they call as long as they give it some kind of name so you don't get yourself too mixed up, does that make sense to everybody? It may not be so bad when you're comparing two files against each other but when you start to do like multiple file comparison if they don't have definition lines you can get very confused very quickly. And you can see, it could be anything in any format. You could type in there, you know, my dogs gene sequence and it would accept it. It doesn't care what you call it.
And so now, you could copy and paste all of this out or you could use the send to text file or clipboard functionality that you're used to using with Pub Med. Also again if you didn't want to take all of it but you only wanted to take the complimentary part from 3300-4 '07 that Marie was talking about you'd say take this part to this part and take it from the complimentary side not from that side. Does that make sense to everybody so you can tell it to take only what you want or you can take the whole thing and deal with it that way. Is everybody with me on that? Okay. Anything else about this record that we want to look at?
Let me show you one more thing. Like all things in Entre, the links menu is going to be extremely rich, and one of the things that I think has been kind of interesting, this record actually doesn't have a very rich bed of things by comparison, but you know, it used to be that they would list out all of the things they had links to but as there became more and more different databases, shouldly they all collapsed and there's that tiny little thing that says " Links", but I've run into a lot of people that have never really noticed that link that says " Links" or never pull it down, so to be fair everything that's in here under links is also in the body of the work, so here, let's just go through that real quick. Because this will maybe orient you. So where it says in links gene, where do you think that's taking you? To Entre Gene and within the body of this record where you have genes, there should be built in hard links to gene. We'll text in Pub Med Central. Well, the question would be that you had multiple papers in here, right? So you have Pub Med and Central but remember you have multiple Pub Med papers here. You have two papers from Pub Med, does everybody see that? so you could go one by one and click on these individual IDs and go to one paper to one paper, or hopefully and let's just see if this works, if you were over here in links and chose Pub Med, how many should it give you? Let's see, does it work? Yes! Okay, something is right with my world here! And you see that one of these is free in Pub Med Central and that's why you have that.
So that's the kind of relationship you should be having. These should all be, where it's a direct link to a database like protein or like taxonomy, it should be the equivalent of however many matches there are so there's only one organism so taxonomy should take you to that one organism link. So beyond the hard links, then you get into the related sequences and that's no longer a direct hard link one-to-one. That's an algorithm that takes this sequence and uses the BLA ST algorithm that we're going to talk about tomorrow to show said Wednesdays that are related up to a certain cutoff score and then we'll talk about that a little more this afternoon when we do all of the different Entre searching, because some people like to do that kind of discovery where they use relateed articles or related sequences and some people will not touch that with a 10 foot pole because they don't understand enough about the algorithm or what the cutoff values are so they don't trust it, or why did it pick up this and not this so I think there's two camps there and it will definitely be the camp that is like I prefer to go out and do my blast search and set my parameters. This is designed to be kind of a pre-calculated quick kind of thing. Think of it like Pub Med's related articles but working on the sequence level, not on the word in the record level. Okay?
[Class Participant Speaking].
That's a good question. Let me actually go the GenBank display. let me go back. I think that the different will be up here in the reporting line but let me see. No, it's the same. When we go on break, I'll have to explore that. I don't, on the face of it, see anything that looks very different. Does anybody else know?
[Class Participant Speaking]. It would make sense. A lot of these, you see it has the features jump to and the sequence jump to? These are to allow you to navigate through the record so you can jump down to the features table or the features key as they call it or somebody who wants to jump right to sequence, and remember this is only based on two or three papers, right? Or two papers in the direct submission. If you have a record where they published a lot of references, the gene reference and functions that David was talking about this morning, that middle section could be really really really long, or that top section could be really long, and so somebody who just wants sequence probably does want to be able to jump all the way down to that so that's a good point, yeah, so Kevin we'll make sure we do that in the full record.
The other thing that is kind of hard to see is does everybody see up here between the hyperlink number and the kind of brief version of the display, that there's this little supports link? That's relatively new and if you click on that, you get two things. One is that you get the option to do a lot of the same display features. In fact I think 99% of that is duplicate to the display options. The thing that's not is the revision history, and so when you guys go to do your sample exercises for this module, there's a question about the revision history of something, and so keep this revision history function under reports in mind, but also when you go to do that, don't get overwhelmed by it because as I said NCBI has very particular ways of dealing with dates and they don't actually expect you to ever kind of come up with what the initial date of submission to NCBI was. Like they like people to e-mail them and they will check it for them because it's usually related to some kind of legal inquiry so the database is not reliable in terms of that first, let me just show you for this one. I think it will make more sense if I show you. You don't mind skipping around a little bit do you? Okay.
So you see here that it says this was first seen at NCBI on May 2, 1996, at 12:28. If somebody comes to you and says, " When was this deposited with NCBI"? Don't give them this date. Say you need to e-mail the information at NCBI and ask them and they will give them the legally approved absolute accurate date and time. This is a database generated thing, and so when it actually went out in the database world, it may not be actually when the time that they received the initial e-mail submission, you know, so this is not a legal answer. If you want the legal answer, you should go to NCBI for it and they will turn it around very quickly. They just say, for those of you doing commercial implications of the work, don't rely on this.
[Class Participant Speaking].
I think that's why they, I think they're also very concerned about that but the way that the system used to work with people like e-mails coming and people updating various pieces of it and how it got in and when it got checked and all of that, I think they're very sensitive to the fact that they can't present that data that way and so they don't want people to rely on it, but it's very good went to say that this is not your typical way to expect databases to behave.
[Class Participant Speaking].
Okay, so, the current version of the record that had been most recently updated is the version that's live and displaying in the database. I mean, dead is probably not the right word. It's just when you do a search of this database, it should only retrieve the live records. If you want to retrieve the dead records, you have to get them from the revision history so it's like saying it doesn't want to show somebody who is naive the super seeded record. It's only going to show a naive user who comes to do a search whatever version of the record is the most current record, that is the live record. Maybe obsolete would be a better term than that, or super seeded or something, I agree, but it means what we suspect it means.
And usually what happens if you go to one of these, it should tell you that what I'm looking for is where does it tell you that it's actually the super seeded record? It should have told us?
[Class Participant Speaking].
That's what I'm saying! You can also tell it to show you the difference between them, so it shows you what changed between the two revisions. So this can be helpful if you're trying to figure out if what has changed is a major change versus just some small thing, so like here, they used to have Med Line IDs but now they have Pub Med IDs, makes sense to everybody, right? Things like that. Any questions about this particular record, this very simple record? Good!
So I'm just going to open the GenBank sample record that we talked about for a minute, so you can see how it's set up so that if you have any questions about what we talked about before that you can come back to it. Okay, so this is the sample record that we got from the site map, the alphabetical site m. Does everybody kind of vaguely remember how we got to it? Everything that's in blue here has been hyperlinked with explanations so if you're like well what is this SCU49845, it will jump you down to an explanation of what that meant, okay? Or the sequence link, you guys were really good. You already know that it was 5,000 base pairs, but if you wanted to know more about the sequence link, you could jump down to that part of the record.
The main part of the record that I find myself frequently needing help with and this record just doesn't have a lot of them, but are in these features, so sometimes they will have a feature in here where I really won't understand biological , like what this is or what they're talking about, and this is just me. You know, people like David reading these records may go, Oh, I know exactly what this feature is they're talking about. And some of them probably seem more familiar after todays lecture in terms of where the start codon is for example, or what function a particular gene has, but there's been other things in this where I've been like, okay, what is that? And there is a guide that explains, so like for example, this little reverse caret, earlier I was showing that they meant this definition line is a definition, don't read that. Well, these mean something else in that part of the record, and you can see the explanation for them. So sometimes the same symbol might have different meanings depending on where you are in the database, and so going back to Marie's question about the compliment, so if you didn't understand what it was saying about compliment, you could come here and read it and it will explain, Oh, it's on the reverse side, it's going the opposite direction, here is how that's working, so, before I highly encourage you, when you go to start teaching, you're going to do a workshop on this, you know, pick a sample record that you're going to teach with and step through and make sure you really understand everything that it says because the question that somebody is going to ask you, and don't get me wrong, I like teaching where you run with what the audience asks you to search on, in Pub Med, I'm always game for that, like tell me what you're hooking for and we'll search on it live? Don't do that with this your first time!
{LAUGHTER}.
Pick a record where you know the answer to everything so that you feel really confident, like pick something that's clinically relevant or scientifically relevant to the research done at your place and know what it is, like know it cold, so that not that you can't not know something and still have good vibes of confidence with your scientists but it really helped in terms of getting up for my first class to be like, I know this record inside and out. I know what every one of these features is. I know what that organism is. I can pronounce what that organism is. You know, pick something that you can feel really comfortable about and then let the other stuff kind of fall from that.
The sample record is a really good way to do that. I just don't teach with this sample record because this sample record is so unrelated to the work that happens with my people, that they would be like why are you showing me that? That's where you pick something where everybody is like yeah, we're really into learning about whatever this is so it's more exciting.
So we're going to finish going through the other records ands and then a break when they're setting up all of your cookies and things like that so we'll probably go for another 10 minutes and have you have time to do the sample exercises and then we'll take a short break.
Okay, so we had looked at the nucleotide record for primary or archival sequence that was deposited by a scientist and so what we're going to do now is look at a protein record that falls into that same archival situation.
Now, this is the static record that was captured at the time that this was done, and I'm going to just walk you through a few things here but frankly you could go live, you could just follow the link to go live in Entre, or however you want to do that. Okay, so what's different, you know, given that we just familiarized ourself with the DNA record, what's different here that we're looking at in this protein record?
[Class Participant Speaking].
Yeah, so your length is in amino acids. A lot of these do begin with A, but the lettering system, it's all off now, when they started out, I don't think they anticipated how big these databases were going to get so they started out with like lettering groups that made sense biologically and everything else but now they're just out past that and like whatever new combination of numbers and letters that has not yet been used is all good.
Okay, so do we have multiple versions of this record? Yeah, okay, and we don't know yet like what the changes were but we do have something.
[Class Participant Speaking].
Wait you have two things. One SUV A version number with a .1 which means the first new version, so that's the main way you could tell, I mean, you'll definitely see records where you'll have like version with a point, there's a .1.3.4, depending on how often the people like to touch their records, you might see quite a bit more of this. Did we get any helpful keywords from this scientist some no. Okay, what organism is this?
[Class Participant Speaking].
This is one of the ones where they're kind to us and you could type in human in the organism field and it will map to homo sapiens.
Okay, so here, when we talk about this, you notice that they're talking about residue as opposed to amino acids in the body of the paper. I don't remember if we talked about residue in the protein section this morning but you'll see that, sometimes you'll see residue instead of amino acids.
Yes?
[Class Participant Speaking].
It's probably good she has the cell phone, {LAUGHTER}.
There we go. Okay, so, where were we? All right, so it's then talking about the whole length of the protein sequence, because it's talking about the whole set of residues. We have published article in what prestigious journal?
Nature.
Yes, these people and they also have the direct submission data from the initial lab, and this is a little interesting. I actually am not sure what they mean in the comment. The comment field is another free form field that they can use, and this has a comment that says, " Method conceptual translation". And so, you know, we were talking earlier about what was experimental and what was not, and all of that, and that would be one of those things that I'd probably go back and say let me make sure I understand exactly what he means when he calls this rrd a conceptual translation, just to be sure.
So, here is an example though where you see a couple of things. So we have the product of this. This is the human MLH1 gene and remember we said these are archival, right? Does anybody remember when you did gene with David this morning some I don't think you actually looked at this gene yet, but do you think the name of the gene was H for human MLH 1? Some say yes, some say no. It depends on how far ahead you read in all of the modules before you got here, {LAUGHTER}.
So the actual maim of this gene now doesn't have the H in front of it. It just has the MLH1 and a lot of the materials you have in your thing but they aren't going to go back and make that change, so just because the gene might have been called this , that or the other, this is archival so whatever the person submitted is what they're going to call it. This happens with a lot of other, you know, a the lot of other things. They didn't know what it was so they called it this and once theyfigured out what function it had they wanted to give the gene a name that more related to the function that it had so there's a fween I always use that relates to Alzheimers disease and that makes sense to people but originally it was called the S182 protein, and so some of the records say S182 protein, and so that happens, because this is archival, nobody is tying together all of the namechanges. If you wanted something that tied together all of the name changes where would you go look?
EntrezGene.
All right! So we're getting like that teaching point across very well here. So here is an example of what I was talking about before. This particular gene was in the human organism, right?
Yes.
But this note says that it's a homolog of the yeast gene, so somebody who was keyword searching on yeast, not in the organism field, but just keywording yeast would get this record, because it does say yeast in the record, even though it's not yeast the organism, it's just a homolog to the yeast gene so you see the difference between just the regular and when I say keyword searching I don't mean searching in that keyword field which everybody is leaving blank. I'm talking about next word searching, kind of non-field specific searching is I'll try to say text word instead of keyword so I don't get us confused but if I do you'll know what I mean. Yeah?
[Class Participant Speaking].
Okay in Pub Med it would be different because they don't actually have an organism field. Pub Med would have the mesh term for that organism but so you would search that organism in mesh but you wouldn't know whether it was about that organism of papers of of findings within that organism. I don't think you can search that way in Pub Med. What you would do is I think you'd do the Pub Med search and then follow out the link and then apply the organism field that you wanted to do that subset of results that you linked over from Pub Med, and we can do that when we go into EntrezGene searching later this afternoon. If a lot of you think your approach to it is going to to be from Pub Med to the said Wednesday databases, we can do that. In fact the first class I used to teach at Cornell, and Kevin I don't know, but the first class we taught was all about if you started in Pub Med, where would you go to all of these databases from Pub Med, like what were the hard links, related links but things like you're saying, what would you find in Pub Med and how would you transfer that into searching other databases and there's reasons you would want to do that and the two reasons we always gave about why you would want to do that. One is that if you needed technique information, that is you needed to find sequences that had been identified by particular techniques, a lot of the scientists are not putting their technique information in these records, but in the Pub Med indexing of the article, that says how they got this information, they probably describe in the Pub Med article level the techniques and methods and materials that they use to do it, so the indexers would probably cover, they had done this work by, you know, PCR like they keep talking about that today or by restriingz segment link, the technique information would be in the article so the indexer s might pick it up so you would search in Pub Med related to the technique and then follow the links out to the said Wednesday data that actually came as a result of that technique so that's one reason you might want to use Pub Med. The other is that when you get to things like human colon cancer that we're working on, you don't have mesh here. There is no mesh, so if I type in colon cancer it's literally searching for the word colon cancer so if everybody is writing about colonic neoplasm, they aren't, but if they were, or if you were using colon adanoma or something, you're not going to get those records unless they actually use those text words so that's another reason maybe to go out and search with mesh, you know, use a very broad thing that use your synonyms, use your mesh and then say show me the records that came up with that the search that have nucleotide records or Exw records or whatever kind of sequence you're actually looking for. So it's a good conversation to have and it's a good strategy. If you're an expert Pub Med searcher and feeling a little bit like well, how do I go from Pub Med to here I think it's a very safe strategy.
Any other questions about that? Okay. All right well let's just finish this up. Remember that wealth of protein databases that you were introduced to this morning, all of the Swiss-Prot and everything else some we're talking about protein records now and many of them as we've mentioned earlier were brought in from these other databases and so you can see that they give you the numbers in those other databases.
And then the same thing with the sequence again. And this time, it's not Aposhtions C's T's and Gs, so much, it's your various amino acid groups. These are all single letter by the way. You have the three letter groups but there's also single letter groups and these are the single letter groups. Okay?
So if we take this record live, and see if there was anything different --
[Class Participant Speaking].
You mean about these chunks?
Yeah.
That's a good question. The question was -- yeah. That's a good question. I don't know why, the question was why does the default GenBank display whether it's for nucleotides or for proteins ? Why does it chunk them the way that it chunks them and I don't actually know the answer to that because you saw when we go to get the fast A sequence we'll use for other purposes it doesn't chunk them that way at all. It's just a pure thing. Does anybody know?
[Class Participant Speaking].
Oh. So it's for a very practical, so it's about counting off in groups of 10, so it's not like there's any biological reason that it's 10. It's a human factors issue, okay. Excellent. Did you have a comment about that or you knew that's what it was?
Yeah.
[Class Participant Speaking].
That's just the first numbering, that's so that you can count them all together, so we said this is from one to what was it? 761? Yeah, so it's just to count it off, so the one is just saying they started from this number. That's completely arbitrary. Where this actually lies on the whole chromosome, you know, it could be the 60 thousand thing down on the list. It's only within the group that they're counting hat it's a one that is set at the beginning. That's a good question. Okay? Yeah?
[Class Participant Speaking].
That's a good question. swren rally, they seem to try to link everything as much as they possibly can can. I think you're right about the Swiss-Prot thing and generally this is mostly linking within NCBI although we'll start to look when we get moo more links, for example, like in Efw, they link a ton of things and that's one of the things I love about EntrezGene is they try to take advantage of the wealth of other resources that Davey alluded to this morning but I think this is mostly just a function of these things are all kind of defaults they' remaking cross references for. The note field could have nothing in it or it could have a lot of things and I think it's just a case they don't actively try to make hard links in the note fields. You might, I might be wrong and we might find some other note fields where they've done it. I just don't think they really bother.
[Class Particip