Event ID: 967294
Event Started: 4/4/2008 8:30:29 AM ET


Please stand by for realtime transcript.

Captioner is waiting for entry into classroom for audio connection.

Please stand by for realtime transcript.

Captioner has now gained entry into classroom and is waiting for audio signal.

good morning! No problem, I am ready to go whep ever things are ready!

You're welcome.



Hi, tell me if you can hear anything. This is Arpita.



Okay, we can do it, quickly.

[ Class Participant Speaking ].

Yeah, we are.

Okay, a short review of the screen for Blast and the things that you might want to change when you get to it.

You can always get to Blast from the NCBI hem page, right here. So let's Bo down to the nucleotide blast. Things that you definitely want to change, unless you have a specific reason not to, is which database you're search in, so if you're searching for a human gene, it's fine to keep it that way but make sure that people know that that's what they're searching when they come in. You want to change highly similar sequences unless you're doing something like a gene only project and then you probably won't be using this software anyway. You want to change it to somewhat similar sequences, which is the standard blast. And under all xwo rhythm par am the"ers you want to change the maximum target sequences to give yourself more opportunity to see things.

So those are the three things that I always tell people to change right up front. Make sure you're searching the correct database. Make sure you've got the correct program chosen. And give yourself enough room to see the things you want to see in the results.

[ Class Participant Speaking ].

Well, you want to show more because you want to make sure that you're not missing significant matches to things, so if somebody is looking for, they have a particular domain that they're interested finding, it may not come up in that first hundred hits but if you scroll down a little ways farther, you may see matches to that domain that you wouldn't see in that top hundred. There's so much, see, there's so many sequences in the database now that that first hundred may only find exactly what you're already searching with, so it may be a hundred sequences of that particular gene from three different organisms but just below that you may have 20 different organisms or lots of variation in that gene, so you want to give yourself room to see, and then when you go back and reformat you'll also have more availability for looking at different things if you reformat those searches.

[ Class Participant Speaking ].

Most of the time, they will recognize what they might want, but the different Blast types, many of them will not know what mega Blasts or discontinuous mega blasts is, which , and that's why they leave it as a default and just use it, because they have no idea what it is, why it's different than Blast N or discontinuous mega blast is different than mega blast, so those things you might have to explain a little bit. That's fairly simple. Actually, if you, it's easy to show simply by switching how that changes the parameters .

You will definitely find more things with somewhat similar sequences than you would with megablast. We're making the search for stringent if you lower that expect threshold, yes.

Okay, any other questions? If we have time we'll show you some more examples from structure this afternoon so you can really see why there's an important relationship between structure and function.

Okay so now we'll go ahead and switch gears and go back to EntrezGene, so actually if you went ahead and went into Blast here, what you can always do is use the NCBI logo to go back to the NCBI homepage, and I'm just going to go in the search all database and choose gene and hit " Go" with nothing in there to be taken over to the homepage of of gene , but please go ahead and get to the homepage of gene now before we get started going through it and I'll switch over to the gene module.

So I think the most, one of the important points to remember is although we're going to focus on EntrezGene, and talk about EntrezGene until that's certainly one of the things that you remember is the take home point, information about genes is actually available in a lot of different NCBI resources so when you start to look at things like the site map or the Resource guide, you're going to to see lots of other resources that have information about genes and we've already looked at several of them that do, the protein reference, sequence reference, have information about genes and there's lots of different places that have it.

But EntrezGene is going to be where we're going to focus our energy and I'm going to take you back to the homepage there for a minute. Stop and take a look at what they're trying to lay out for you here. We've talked about in other Entrez databases it not being clear what the search options are and what you can do and what kind of data you'll find. Because this is intended to be kind of the entry point for P everybody who is looking for gene information, they've tried to make this really explicit, so somebody who knows really nothing about Entrez or anything else, they're saying , look, you can find xweens by coming free text, by searching the game of the gene, by searching where on the chromosome it is and they give you like the concrete examples of how to do all of that kind of searching, so in terms of working with under xwrad watts or working with busy clinical researchers, it's designed to be really transparent and straightforward, so even if you for get what all the searching options are, they're laying it out kind of here right for you.

I highly recommend that you sign up for the mailing list for gene. There's a lot of the NCBI resources have e-mail lists that are not discussion lists but simply dissemination lists so when they're coming out, you know how we said like new things happen all the time and they come out with new versions and features and you're getting ready to teach and you open it up and you're like wow! That's new? Being on these lists is kipd of your best defense against being surprised in that way L although let me just say they aren't a total defense against being surprised that way because some of the little changes that might throw you for a loop don't always get disseminated but the big major changes, when Blast went to that new homepage design and everything else from its old pages there were lots of announcements on this, the Blast announced e-mail list that said that was coming.

So if you're going to teach gene or use gene and you probably want to go ahead and sign up for the gene mailing list. That's over here. Okay? Just and that's something you can do really quickly and at least feel like you've done something towards keeping you up on this area and there's not very many messages so it's not like you'll be inundated with things either.

The help information is all across here and they try to follow this structure for all of the Entrez databases. I'm sure you're used to in Pub Med seeing kind of all of the various help be khurs up there. I think the most confusing thing is that when you have an about and an FAQ and a help and a handbook, it's like which one of the multiple help options should I actually use to try to answer myquestions and that's where it gets a little sticky so as you start to develop your training materials, I think you'd kind of want to get a sense of that.

The handbook, you can scroll over to that for a second, is really just the chapter on EntrezGene in the NCBI handbook which is part of the books collection that's, Entrez has a collection of full textbooks and this is what this is. So you can see that this was created in 2005 and sometimes the handbooks seem to be updated and sometimes they don't seem to be as updated and they seem to be more static, so I tend to try kind of the other stuff first because it seems to be a little bit more updated or more kind of on the cutting edge but it ke pendantss on what you're looking for. Their help desk is very responsive so if you find anything weird in here, you you.

You should feel free to ask. I mean, they're very clear. Look at this, there's a sex on corrections, feedback. If your researchers are using this and they find something that is weird, or they know they've done research and a gene has an additional function that is not in the record there's specific things for reporting that. That's nice. Sometimes you think it would be hard to suggest a correction. This is really, really clear about what should be done to correct the record.

Even for typical Entrez things, if people were like, what is the clip board? Or what are these things? They have nice explanations. It's a nice approach to what will become a premiere port tael for their gene information. Is everybody in love with Gene so far? Great.

[ Class Participant Speaking ].

There's true. Most of what is in Entrez Gene is based on there's being a reference sequence record. If you have done a search in RefSeq the likelihood that what you will find in RefSeq and gene are equivalent are intentional. That's not surprising. These why they now have that little box that shows up that we said yesterday that said that this information is in gene 2. Many will find the information in gene more palatable. Selecting the pieces of it that you really want to focus in on and making the links, instead of a links pull down. They have the links built in a more visible apparent way. Each record represents a single gene from a single organism. You can see there's a lot of or gappisms, like human, where there's tons and tons. There's others that only have a few genes in here. Certainly, it depends on the work flow of RefSeq. If you didn't find something in RefSeq the likelihood that you would something here is small. If you start with the gene and you don't fine anything back out to Entrez nucleotide or protein. See if you get a RefSeq, if you don't back your way to the archivals. Curating takes time. You have things in that process yet. Okay.

We're going to open a gene record. You will see all of this different information that is in a record live, as opposed to us just talking about it. Do they want us to do the MLH1? Probably. The gene is the MLH1. If you know your gene name you can just go ahead and search it. I typed MLH1. It's that DNA mismatch repair protein we've been working with for the last two days. Remember I said that you should probably only have one record for each gene? You can see we've retrieved 802 records. Scroan down and take a look. -- scroll down and take a look. If I was really interested in the MLH1 gene that I was interested in the mouse, I didn't want to scroll down until I found the mouse record I could just ask for it. Now I'm just getting the eight records that are mouse. You see we still have a couple of things going on besides. We dealt with the species issue. What are some of the issues we could be dealing with in our search here?

[ Class Participant Speaking ].

Yeah. The home logs on the different chromosomes, that's true. What we're seeing up here are the actual names of the gene. Why else are we getting besides MLH1? What else are we getting besides MLH1? If you scroll down further on your screens.

[ Class Participant Speaking ].

We're also getting different genes, like MSH2, and MLH3. It must be because somewhere in the records for these genes they're mentioning our MLH1. We had only text word searched. We didn't say search it in the gene name field. We said show us anywhere that MLH1 shows. You might be like, I searched on the gene. She said there's only one record for each gene. There is only one gene named record for that gene in that organism, most likely, unless you have a discontinued superseded record like there.

[ Class Participant Speaking ].

ORGN, yeah. That's the short name. To be fair, there's probably a couple of the short name field tags that you want to just learn. But if you don't remember the tag, you can always the limits to do a field-specific search. You see in the limits, we'll go through the limits here in the minute. We'll go through it while we're here. You can do the same thing that David was showing you in BLAST. Which is you can type in the organism, it will jump you to that point in the list.

So if I, this is funny, I'm typing mouse here. This is the actual organism list from, come on, what is going on? I hasn't made Myspace yet. I need to make Myspace, yeah. This list is the formal list that is coming from the NCBI tax onmy database. You saw earlier I typed mouse in the search field. For the most common organisms like human and mouse it doesn't force you to be technical. It's a built in save you from extra work. If you are doing a nontypical organism you have to type in the right name. Here on the list it not say mouse, it's pulling the list from the tax onmy, it's looking for the real name.

[ Class Participant Speaking ].

There's a lot of cross referencing. How tight is the tax onmy list? I think there's a lot of cross referencing in there and a lot of things to help improve the retrieval. In the tax onmy I think there's several sessions. In mesh you have entry terms and real terms. I think this is from the official names. I think there's larger [ Indiscernible ] that appears in other NCBI terms that is entry terms and real names and higher level and lower level. Here it's a specific level. It varies, you will see it differently here and in BLAST and in something else.

[ Class Participant Speaking ].

Okay. You need to put in which organism you want. This is tell you you need to put in human or mouse. If you just type in ORGN it will not know which organism you wanted.

[ Class Participant Speaking ].

The organism field limit in Pub Med? They used, most of the the organisms are machine esh terms. There's a whole -- are mesh terms. There's a tree of a lot of the organisms.

[ Class Participant Speaking ].

You mean why don't they do that? You could certainly, I don't know why why. You could certainly suggest it. Yeah.

[ Class Participant Speaking ].

I agree. This is much easier than preview index. They've made a lot of improvement it's to how this search is based on the fact that they expect naive users to be using this. These limits with the organism pull downs, they weren't there last year. I think BLAST used to not have the organism pull down work the same way. You had to know what you wanted, yeah.

[ Class Participant Speaking ].

That's a good question. Did you make the comment that the higher taxes seem to work in here, or are you asking?

[ Class Participant Speaking ].

Let's see. Yeah, what example should we use? Let me look --

[ Class Participant Speaking ].

Yeah, let's -- it looks like here, no. I think what they're doing here is pulling all of the individual specific organisms. But it's not that they couldn't. It's just a question of what they did, not what could they do. About the tax onmy, the development that they do for the tax onmy, I don't think they necessarily, if somebody improves an area of the tax onmy that's not an area that is well represented by records in GenBank they're not going out to other tax onmy resources that are in development and trying to incorporate them. I do think they're constantly being informed by that external tax nomic knowledge change. Yeah. I know about that. I don't think they're partnering in that way. David is shaking his head too. I don't think they incorporate that in a coordinated way.

[ Class Participant Speaking ].

Right. Everything in here has a record in GenBank. It's not an exhaustive list. As far as I know they're not coordinating with others to make this a better tax nom resource. Whether they're making individual changes as new things become known and changed, how they handle that, I don't know. We could explore that with them, yeah.

[ Class Participant Speaking ].

Right. The way you would do that is in the -- we're talking about searching by subspecies, you would go into the tax onmy database and pull up the level that you wanted. Whether it was a high level or a specific level and follow the Entrez Gene links from the tax onmy database. If your entry point is about the tax onmy you go in through tax onmy and follow the links. This is if it is a known gene and you want to focus it in. I agree with you. This is a relatively new development. I think they would appreciate suggestions about it works and how it could work for better for people coming in with that approach.

Okay. We have organism sleeks that we can make -- selections that we can make. Here's human. As you pick your organism it changes, it knows which chromosomes are available for which organisms. We talked yesterday about how the mouse chromosome is different than the human. If I'm looking at human I get the choice of 22 Xs and Ys and [ Indiscernible ] for mitochondrion. You click on the limits button to get into this list. And then you type in your organism. Okay.

Once you pick the organism you get the corresponding chromosomes. The honeybee chromosome is different from the human, clearly, set of chromosomes. You can exclews certain kinds of things. You can exclews -- exclude certain kinds of things. If you only want to look at records that have been reviewed, or if you are curious about the ones that are predicted or provisional you can look at those. Going to the tax onmy issues that we were discussing a minute ago, if you want something that is broader than the specific organism focus you have these options. I'm not sure, if you were focused on these would they meet the need that you were just talking about, Jennifer? What they've done, they've not shown you all of the plants, just two of the most commonly selected plants. Clearly if you went out and went into the tax onmy you would get a huge difference between plants. There's many, many, many more plants than you would think from looking at this. Yeah.

[ Class Participant Speaking ].

Let me go back to the human one. You mean these two numbers? Yeah, that's a good question. I think it has to do with the mapping. Yeah. I think it has to do with the actual positions on the chromosome map. I'm go back and get an answer to that. I haven't used that. I think that's what it relates to.

[ Class Participant Speaking ].

That number is the number of nucleotide in chromosome one and if you knew where you wanted to be you could narrow it. You definitely can. This is what I'm trying to look at. You can limit by chromosome without, you can limit by organism without limiting by chromosome.

[ Class Participant Speaking ].

That's what I'm looking for. Yeah, no. No. MT is just the mitochondrion. Normally, you know, yeah.

[ Class Participant Speaking ].

Yeah. It's still doing it. This gene is not on chromosome one. Let's see what happens if we -- let me just take this off for a second. Yeah. You are right. Now it is searching just for chromosome one. That is weird.

[ Class Participant Speaking ].

Yeah, no. Yeah, okay.

[ Class Participant Speaking ].

The other thing is you notice this search box is limited by chromosomal region. When I normally do this I limit my search to the organism field. And put in what organism it was that I was looking for.

[ Class Participant Speaking ].

Right. I think that you could do it down here as well. We said we wanted humans. Right. I think if you just wanted to not deal with the chromosomes you could down here and say limit by tax onmy. We'll leave this blank. Forget this for a minute. Turn that back. Now we've just selected human. Hopefully when we do this we'll get all of the chromosome. Now we have 36. We have our MLH1, again. Which was on chromosome three.

[ Class Participant Speaking ].

These are the most commonly occurring organisms in GenBank by number of sequences, right. Yes, if you check the high level, if you say show me all of the plans it will give you everything that falls into the plant tax onmy. Yeah. Exactly. The problem is that's a really, that's a really high level. These are really narrow levels. There may be that point which you want some group in between, that's when you have to take your approach from tax onmy and go from tax onmy to the link.

[ Class Participant Speaking ].

Yeah, absolutely. There's a place that defines the RefSeq status. There's several places. One of them is when we were talking about the -- what's the name of it? The information for all those submitters about what the records are. It goes through all of the different fields. Because the RefSeq status is a field it has that information. Your Entrez slides from day one have four times the link to that handbook that explains all of that. There also should be, if you are on the RefSeq page, which we just followed from that left hand link, there should be -- here's the status code. Let me review. So because gene comes from RefSeq, on the left side here when it says other resources there's a link. When you get to the RefSeq page on the right hand side there's information about the status, which will answer the questions. The difference that you were talking about was between the validated and the reviewed. You might read these and still not understand in some cases what is going on there. A lot of the work that is being done is being done within NCBI, there's also external groups that work with NCBI on these. And so validated looks like in addition to the work that NCBI has done they've gone ahead atried to compare it to the prevailing standards available for that. That's what the validation is, the external comparison. Who was the one that validated it in the particular case of the record? People that work with model organisms might know who is doing their validation.

[ Class Participant Speaking ].

Okay. Good question. Any other questions about this so far? Okay. Go back.

[ Class Participant Speaking ].

Yes?

[ Class Participant Speaking ].

The difference between gene and what? Locust link is what was the precursor to Entrez Gene. It was a database where they pulled together the gene base information and making it available. There's a slide in your module here, I'm going to go to the slide list. There's a slide here that says comparison of locust link and Entrez Gene. You may have people that were attached to locust link who haven't been paying attention to the transition. As we get further away from this, that went away in March of 2005. I'm seeing fewer and fewer people coming in and mentioning it. Especially among your librarian colleagues. People that took this course a couple of years ago and haven't done much with it since they may say they remember it. They may say there was this great resource that I used to use called locust link. This is what has happened to it since. Okay, let's see, where were we? Where were we? Okay.

[ Class Participant Speaking ].

They do kind of weight it. It's also weighted search. Certainly it makes sense that something that has the gene name in the gene name field is likely to also have that gene name repeeded most frequently in its record. The searching algorithm gives it more weight. Once you get past the ones where it is in like the gene name field or the title type field, then it's just a frequency of appearance kind of behavior. Okay. What else haven't we talked about here in the limits? Okay.

Creation dates and modification dates. The same thing here. If you are interested in a gene and you want to keep track of what is added to these records about the gene you can set it up to let you know when there's changes made. This goes back to the point on the slide, which is that this is really intended to be representative information about a gene, rather than comprehencive information about a gene. Let me see how I want to explain this. There might be a lot of research going on about a gene. This will not link to every single paper in the literature about this gene. If this gene has this DNA mismatch repair. It will link to a paper that explained that function. You know, one of those gene [ Indiscernible ]. There might be 600 papers that talk about that function. The gene function will not show you all 600. It's goal is to have a good paper that represents that function and the other functions that might be available. It's representative, not comp rehencive. What will take you closer to that is if you follow out the links that are connected with it. Let's look at the MLH1 record in a minute as an example of that.

I'll go back to the history and pull the first record. This is the short display. You see a couple of things that will probably make you feel good as a searcher. You get a lot of the official symbol and official name in the kind of standardized format. You also get all of the other names of it. Remember we said before, early on in the life of a gene there's lots of different names as you go along. If your goal is to do comprehensive, not missing a single paper about this gene search you might be oring them all together. We all know that the, I think we all, I think it's safe to say, I have to be careful here, I think it's safe to say that the gene level indexing in the literature databases is not as controlled as it would need to be. A strategy in which you have all of the variations that you can combine with an or is helpful regardless of which product that you are searching. You see this is on chromosome three. You get a specific location. It will link you to the map, which we'll see in a minute. You get the link menu here, but let me caution you to say, these link menus are long. We'll all have the problem that happened over there yesterday, which is you are not going to see half of the link menu on your screen, there's lots of links. I don't like to use the link menus from here. I can never see them all. I always go into the complete record, almost always. I can see everything laid out very nicely.

You have two things on the ride to help you navigate. Because these records are so long, they also want them to be nicely structured for you, you have a table of contents to jump you down to the part of the record that you want. Then below that you have all of the things that take away from here to other places. And this is what I said. That link of links that I said was really long, I couldn't see it all on my screen. We're not even half way through this. You can also see it's not in alphabetcal order. It's something that at the whim of the people who create this. If there's resources that you think people will not final because they're -- in some sections they are. The NCBI, I think if you try to explain this in an order. The NCBI related resources are usually what is offered to you first in the link menu. For some reason the order cdna clones, they want to promote to the scientists how to get their hands on it. They've moved that up first. Clearly you're not ordered the clones from the [ Indiscernible ]. Beyond that, everything from books through [ Indiscernible ] are all NCBI resources. This is their highest scientist priority. Then they start to take you to external reel vent resources. So most everything that you have here going down are not NCBI until you get back to [ Indiscernible ] and link out. It's not an absolute thing. But in terms of trying to have a conceptual map, I think that's mostly what is going on. Whatever their trying to promote, then NCBI things, then external things, within those groupings they're kind of alphabetcal, but not exactly. I can't really, if your innercatalogger is why -- I've done the best I can to help explain that. Okay. Then you see there's, um, you know, other links you can follow. These all expand. When you click on them they should -- when you click on them they pop open with the help and all of those kinds of things. And then this should be the announcements list. They mean subscriptions to the email list that we were talking about before.

Is everybody okay with what is on the side bar? We'll not go into each of the individual external resources. Some of them are valuable resources that have been around a long time and have a lot of really great things to offer. Others of theming like the huge navigator are very new. This is a human genome epidemiology sight. This is the first time that the CDC's epidemiology group and NCBI gene have worked together to have a connected database. I'm excite about that. Currently there's not a lot of stuff built into it yet, because it's new.

[ Class Participant Speaking ].

Even though it's called huge. You see this record is recently updated. These are touched a lot. If you are trying to follow a gene it can't hurt to set up a search and have it notify you when there's changes to the record, or whatever. NCBI is trying to acknowledge when the information comes from an external source. The symbols and names come from the Hugo gene nomenclature committee. They're not trying to reinvent the world. They're using what other people are providing and giving them credit for it, which is nice. You get the information we've been talking about, the status, trying to think if I want to say anything about this. MIM, those are links to the OMEM. When they built this OMEM was not online yet. You will see it in different ways in some of these records.

A couple of these areas are things that I will leave alone for the most part. Until you go through and have your discussion about the maps, which is what David will go through after this section, the context of this will be a little bit more clear. You can see down here in red, this is where your MLH1 gene actually lies. You see there's lots of other interesting things in the same context. You were asking about the number range earlier. If we looked at chromosome three and it said zero to whatever number. These numbers here represent those numbers on the chromosome map that David was talking about. About the locus of where you are. You will be able to, if you clicked on this you would jump to the Map Viewer you will have more on the map. We'll not say anything about this right now. But once you understand maps you will excited that you can start from the gene record and go to the map. Let me go up above.

This is a refresher of day one. You see that you have the coding regions. This is your five prime end and the three prime end. And the regions that are coding. You see NM? What do you remember from yesterday that things that are prefixed with NM are?

MRNA, right. And NP records, with which is a RefSeq protein record. To the right of it this is the CC [ Indiscernible ] record. It should take you to the consensus coding sequences for the whole thing. It gets a little confusing. It's like, when you are thinking that something has CD, it might be conserve domains, depending on what it is, here it's a coding sequence. Okay. There's a sequence viewer, which we will not use here. One of your questions engages you in using that. We talked about the reference and function already. These are the functions that the gene has. They have links to the Pub Med. If you don't remember what a gene riff is they have the help insert to tell you. If you think of a new function you can submit it. There's lots of individual scroll bars in here. That's the other thing that is deceptive. I don't print these very much. I don't think I have ever printed one. If somebody is trying to capture this. You find this record with a student, you want to print it for him so he can take it away, you might check out how the printer behaves with this. I have never tried it. I will do a print preview later. I don't want to do that at this moment. Especially because it's not a [ Indiscernible ].

[ Class Participant Speaking ].

Which one? Sorry, this one? This one?

[ Class Participant Speaking ].

Oh, here. You mean here. Okay. Yeah. That's create. Like I said, I've never bothered to do this ever.

[ Class Participant Speaking ].

18 pages, that's not that bad. When you think about your typical [ Indiscernible ] review length this seems manageable for everybody. Let's go back. That's great. Thank you, I it never tried to send it to that. I can imagine that would happen. You are working with somebody and they ask for to have it. Remember that you can use this to jump down in things. A little bit of what is weird is the difference this terminology between the tail of contents -- table of contents and the body. You are used to the term gene riffs now. You will all remember that. What you have here is bibliography. If you make that connection to yourself that the gene riffs are references from the literature. And the literature makes bibliographies that will make sense, you know.

[ Class Participant Speaking ].

It's representative. When you go out to Pub Med links you see what it giving, it's giving you 347 links. I don't think you will actually see, that's part of the problem. People when they follow the related links, one of the reason they don't comfortable relying on them, it doesn't tell you the strategy that it used to build the links. It says you came over from the gene record. You don't really know, is it just anything that had that gene? What precomputed search did it do? We went from 201 papers here, 201 functions to 346 papers in the Pub Med links. If you searched Pub Med you would find many, many, many more papers. 347 are papers that have links in their records to this gene record, I think. Does that make sense to everybody? Okay.

Okay. We're almost through this. Then we'll turn it over and go to maps and genomes. A lot of these things interact. What you get is information from the literature about what proteins interact with others and records for those. The source of those, some say bind, some say the human protein search database. Find is interesting. It used to be a free database. We talked about this relationship about what is free and what is not. What starts out as free. HPRD that's a free resource that you can use. We were talking about bind last year. It used to be a free resource. Now it's a resource that requires registration, there's tiers of it also. Where some small pieces of it are free, big pieces require subscriptions. There's an open access free part. Again, even if you will not use these for yourself, even if you will not use bind for your own research, you should go ahead and have free accounts for your reference desk so that you can follow the links out. And then at least you can show people the basic things. There are different levels of what you get for the free versus the paid. The other thing to figure out, if the library is not the central bioinformatics unit someone else might have already purchased a license. You should try to find that out. There's more that David might mention this afternoon. When we talk about who you work with and what you do when you get home. And there's the literature links in this, as I said.

The markers will leave alone. We'll get more information on markers when we deal with maps. The last part of this is about clinical information. This is an area where most of you are probably more comfort. You have information about the genotype and the [ Indiscernible ]. The genotype is the variation. And the phenotype is how they reflect clinically. You can see that this gene is associated with different conditions. All of these are conditions that are OMEM records. Some of them might be very apparent. Knowing this is a mismatch gene you might not be surprised by that. But you might not, I don't know what [ Indiscernible ] syndrome is. If you click on records you will find out. We talked about the keg pathways a little bit. It's hoe mol Gus with the rat and the mouse. We'll talk about ma more when we go into Map Viewer. By the fact that it's saying that it's hoe mol Gus with the rat and mouse doesn't mean that it's not hoe mol Gus with other things. It's saying if you want to look at this you can look at it in mouse or rat. But there's plenty of other things as well. They just don't have separate products, okay?

Gene ontology is a big naming group. For those of you into vocabularies. This is a growing area of these records. Their level of evidence codes, we've talking to them. It is a little hard in the help to find what the evidence codes mean. You could browse around before you would find that. We made a recommendation for them to make it more linked. They've not done it yet. If you do a Google for gene ontology evidence codes you will find the list of what they all mean with no problems.

I think we already talked about the reference sequences. And additional links. The things we saw on the side bar. Those are just recapitulated down here. Gene tests, you can see whether there's any labs that test for this gene. Thins like that. -- things like that. That's it in a nutshell for what the Entrez Gene records cover. Do you feel like you understand why we're encouraging this as a one start point for all of this? Do you have any questions about what we've looked at so far?

[ Class Participant Speaking ].

It became operative in about 2003. That was than their comparative slide. December 2003. It's certainly been growing and building and all of that. Let's see. The only other thing that I want to point out, there's a couple. I can say them from this slide, I think. Just because we searched on the gene name you don't have to approach the database from the name of a gene. You can approach it from a condition or a function that you are looking for. This was approached from also the human colon cancer perspective. This is where your synonyms can come into play. This doesn't have mesh in it either. If it says adenomas, or whatever, it says things or than the words that you are using you need to be more careful when you search this. It's still, even though it has this value added information, it doesn't have normal subject-based indexing being applied. There's a lot of cross between the RefSeq records and the Entrez Gene records, they're built off of RefSeq. There are some unique things in the Entrez that are not in the RefSeq. If somebody said I love RefSeq, why are you trying to switch me over to Entrez Gene? Here are some of the other things that you get in Entrez Gene that you don't get in RefSeq. The interactions with the other protein that came from bond, the gene ontology terminology is not that well connected in the RefSeq roars. Gene ontology is now covered in the RefSeq roars, it's not covered in the evidence level information. This needs to be updated now. Pretty soon the RefSeq roars will really connect to all of these things in the future, it's a matter of time. What format do they prefer to look at it in type of question, whether this one has unique information that the other one doesn't.

If you find it in here it will link you to every other place that NCBI and other places have information about it. If you search it in Entrez Gene and you don't find it, then go out to look for RefSeq records for it. Or approach it through other ways. You have a question? No.

[ Class Participant Speaking ].

Oh. Actually, why do I keep doing that? The question was about the labels. Are you talking about the label, hold on a second. Let me go back to the first screen. You can also use the Entrez home, that's probably easier, to do that.

[ Class Participant Speaking ].

Yeah, yeah. Yeah, right. Yeah.

[ Class Participant Speaking ].

Yeah. Yeah. If you were, in terms of using those different things. If you were really trying to narrow it down to something that you could. One I would encourage you not to use, I'm excited about it, it has nothing there, is genes of [ Indiscernible ], if you click on, if you do that search right now, you get only one hit. Everybody says oh, well, I want to look at things that are clinically significant. At this point I have no information about how they're deciding which things are significant or not. I think once that's there it will be a very cool way to narrow it down for people that say I want to focus on variance that are clinically important. Right now I don't know anything more about how it works. If you are looking for information in certain parts of the record, or in the case of these things, most of the time I don't bother to use the symbol for gene name very often the gene name generally pops up to the top. You know, because of how the search is approaching. Yeah.

[ Class Participant Speaking ].

No, sorry.

[ Class Participant Speaking ].

This one, use search field tags. Uh-huh.

[ Class Participant Speaking ].

Uh-huh. Right. I think this is what Maria was talking about. If you were, you know, what was the name of that syndrome? The muir whatever syndrome. If I wanted to say show me anything that had to do with that. Then I would go ahead and put that into focus myself in case there were other places that word showed up. It depends on the word that you are searching, I think.

[ Class Participant Speaking ].

The difference, she did pancreatic cancer. When you use the disease field you get 9. When you use [ Indiscernible ] you get a couple of hundred. If you are looking for a couple of records, sure this is a great way to narrow it down quickly. I don't think most users behave this way. I think this is a way that librarians might approach it. The things on the front are things that the users want to to do with it. It might be worth the energy. To be fair, if the thing you were trying to search on. This example they give for disease is a good example. You might be afraid there's a gene name for that, a gene named that is not the gene that relates to that disease. You want the disease, not the gene name. Being able to sort that out by using an appropriate field, as opposed to sorting it out by visually examining all of the records might be a good thing. It depends on what you are trying to do with it. Um, okay.

[ Class Participant Speaking ].

Right.

[ Class Participant Speaking ].

Yeah. If you search it in the disease field it might be too limiting, a lot of things that relate to pancreatic cancer might not show up. Great, any other questions?

Okay. Now we're going to get to big picture. I think we can kind of skip through some of this first part quickly because we've already been through it. You know what a genome is for any given cell. These slides always go out of date pretty quickly. You can find from the statistics how many complete genomes there are. You can follow those thinks there to find that information. The finished sequence is in there. It also includes some genome projects still in progress. The act to explore regions of interest. One of the things that is very difficult with genome maps is scale. How do you know where I am on the map? I'm going to quickly show you a couple of maps. We'll look at this one, [ Indiscernible ] influenza, this is a bacterial genome. Anybody know what [ Indiscernible ] influenza causes, diseasewise? Not flu. The not influenza. It causes earaches in kids. Kids get a vaccine against this, it helps to prevent earaches. It was discovered during one of the big influenza epidemics. It's an opportunityistic organism. It's always there in our body, or associated with us. It only becomes pathogenic under certain conditions. If you are sick and you can't fight it, it will take advantage of that and start to grow.

In the genome map, at the top, prokaryotic genome maps you have the bread crumbs up here. You have the complete Lynnage for this. This is one particular strain of bacteria, it's not all of [ Indiscernible ] influenza looks like this. This is one strain of it. It gives you the RefSeq information for that chromosome, NC is NCBI chromosome. It Gibbs a GenBank number. It give you the complete length of the genome. It's 38% [ Indiscernible ]. 84% of this chromosome, this is the complete [ Indiscernible ] influenza genome right here. This is the one complete chromosome that is in [ Indiscernible ] influenza genome. Okay? It is a circular chromosome. It tells you the shape. It's a double strand of DNA. There are close to 1700 [ Speaker Faint/Audio Unclear ]. 1657 of them are protein coated. This is relative to you eukaryotic. This is a huge number of protein coating, versus nonprotein coating. That's a huge amount of% coating -- percent of the coating. If you look at human the percent of the genome that produces a final protein product is about 1.5%. It's a much smaller number. That's because most bacteria have optimized for rapid growth. There are structural RNAs, what does that mean? Most of these are the transfer RNA and the ribosomal RNAs. There are --

[ Class Participant Speaking ].

Circular.

[ Class Participant Speaking ].

There's a three-five prime direction, it's just that you don't ever hit the end. Sue dough genes, we discussed them briefly the percent of the nucleotides in that organisms that are G and C versus A and T, okay? Pseudogenes are nonfunctioning genes that look like a functioning gene. They're a gene that is not transcribed. There's something wrong with the gene. It's called a pseudogene. It doesn't actually function. There are 54 others. 54 others means that we don't actually know what they do. We don't know if they're coding for proteins or not. Or if it's a structural RNA or not. The reason that we don't know that is because most of these bacterial organisms that are sequenced are things we can grow in the lab. We're not looking at them under field conditions. We may never see these genes and what they do. It's interesting to look at some of the deferences. One of the faculty members at [ Indiscernible ] department at Harvard went out and did collections of [ Indiscernible ]. He brought them back into the lab, he could not keep them in a dish. They would overgraduate and just -- overgrow and just grow out on to the table. There's lots of genes like that. Because these are organisms grown in laboratory conditions, there's genes we don't know what they do out in the wild.

Does everybody remember what contig is, it's a large piece of chromosome in a is used for [ Indiscernible ] assembly. Or it's used for cloning large pieces of DNA. To do a genome project you want to take larger pieces that you can clone. We have a series of cloning vectors that are [ Indiscernible ] from how much DNA you can put into them to replicate. We start with plays mids, they can contain a small amount up to 15 of DNA. 5 or 6000 bases. You have fadge vectors, we can clone 40 to 50 of DNA. And then you have these larger things like bacterial artificial chromosomes. You can construct the chromosome that you can grow in a bacteria that is mostly a clone of other organism's DNA. You have yeast artificial DNA. [ Audio Cutting In and Out ] get these larger pieces of DNA. To isolate that piece from the rest of the genome and break it down to do the analysis. Most of the genome projects, a lot of the genome projects worked on chromosomes. There was a chromosome nine group, a chromosome ten group. They worked on one chromosome. They could separate that piece away from the rest and start to work that piece. The con tig is what is put together in the end. You use things like megablasts to start putting the ends together and connect the con tigs together to put together the whole chromosome.

We have blast chrome logs up here. This first one is an interesting one. It's called cog. To tell you what an ortholog is, I have to tell you what [ Indiscernible ] is. In biology it's a term used in evolutionary biology. It means that this trait derives from a common ancestor. Okay? So I have an arm, a bird has a wing. But those are hoe mol Gus structures. Somewhere way back one of our ancestors grew a limb. That has evolved into various things. In molecular biology people started to use [ Audio Cutting In and Out ] to mean similar. We can have sequences that are not actually hoe mol Gus based on the definition. Because you have two different types of evolution. Things can look the same even though they come from different places. Or things can look different over time. So using the word home log is an opinion, it is not an scientific fact when talking about molecular biology, okay? Unless you can go back and look at the ancestors and follow that evolution all the way through, which we can for some of the sequences. You don't know that it was derived from the same.

[ Class Participant Speaking ].

Yes, yeah. You cannot, there's no such thing as percent hoe molgy. It can be 20% identical or similar. But it cannot be 20% hoe mol Gus. Within the meaning of mow molgy we also have two differents, one is ornology, that is the same structure in a different organism. Those are or thrch hologus. You can also have [ Indiscernible ]. That's a gene duplication within the same organism. Okay? [ Indiscernible ] means within the same organism, between different genes that look the same. [ Indiscernible ] is two different genes that look the same in different species of organisms.

[ Class Participant Speaking ].

It's actually O. Okay. So now what we have here are in our clusters of O groups we have groups of genes that are based on a function here. This is like an ontology, based on function of what the gene does. Then you can look at how many of these gene's groups are involved in translation? And we've got all of these genes involved in translation here. It's a list of the genes with that function. From this particular organism. You can use those, if you are looking for a gene involved in translation in another organism, a particular type of function, you can pull that gene out and get its sequence and [ Audio Cutting In and Out ]. Against your organism.

Usually you can see how many 3D structures are derived from this organism. I don't want to spend a lot of time on these. There's a couple, let's see -- here you can look at what the relationship is between whole genomes. We've set a cutoff of 95 here. That 95 represents the score. Our expect value of 10, which is the default. If we've set that as the parameters we can look at how many of the genes within [ Indiscernible ] influenza match to the virus group, the eukaryotic group, and [ Indiscernible ] okay. We see that at that cutoff it matches closely to most of the ewe bacteria but not to viruses viruses or eukaryotics. You can change those scores [ Audio Cutting In and Out ]. This is another interesting one. Sorry. This is gene plot. I want to show you gene plot. It will show you, um, [ Indiscernible ] that I didn't show you yesterday. We're looking at [ Indiscernible ] influenza RDKW20, versus 8628 [ Indiscernible ], two different strains of [ Indiscernible ]. You can see -- what?

[ Class Participant Speaking ].

It is the same basic species, but two different strains. Down here you've got a list of proteins, it's a comparison between those two. These look like mostly fairly highly conserved proteins. Let's try this one. Over here you see a list with dots that says BL2 sequence. That's a blast that is blast 2 sequences. Okay? If you, I will show you what it looks like. Blast 2 sequences is a type of blast that you can get from the BLAST home page. Under use this for any two sequences that you want to compare. It gives you a plot. A perfect match would be a straight line from corner to corner. Okay? You see small difference here. You pick out these differences that you see in the sequence down here. The hit plot gives you a quick view of how similar these two sequences are. This is a good blast to know about if people come in. They say they want to know if sequence really aligned with this other sequence. This is the same here. You are matching the whole chromosome again the whole chromosome. What happened up here, do you think?

[ Class Participant Speaking ].

It's a type of mutation. It looks to me like a part of this chromosome in one strain is inverted with respect to the other chromosome. Because what we're matching now is the reverse sequence against the forward sequence. That would be go in this direction. Rather than in direction. Does that make sense?

[ Class Participant Speaking ].

These are essentially the nucleotides against each other. This the whole genome against the whole genome. Okay? If you've got the five prime to three prime end going in this direction, or one of them, it should be the same for the other one. Except for this area here where you have an inversion. Part of that chromosome broke out and reinserted. Now we've got, when you align the two from that region you reverse the orientation of the chromosome with respect to that particular [ Indiscernible ]. Okay? That make sense?

[ Class Participant Speaking ]. [ Laughter ]

Right. You can switch which organism is on which side. Some cool things down here with the map. Bacterial chromosomes, we'll talk a bit about mapping. One of the interesting things about bacterial chromosomes is they look like a clock. Originally map positions were based clock positions. The interesting thing about that is when the original mapping was, chromosomes have sex just like most other organisms. You can put two different bacterial cells, one with a mutation and one without, together and allow them to mate, but disrupt it at different points in time. Then you can ask did that mutation get passed over? Which block of time did that mutation pass through in? The chromosome is fed through in a strand in. You can map that gene based on when it passes through in the mating process. Okay? That helps to organize genes on a map very early on in bacterial chromosomes. It was an early type of mapping in bacterial chromosomes. So, there was a map position. Now if we change the hands of our clock here, there we go. You can see we're moving along the chromosome. We can pick out which genes are at which positions. Okay? You can go all the way around the chromosome that way. Now we can zoom in. I've never gone all the way down and looked. It doesn't get us to the nucleotide level here. You can do that by going out to the units cells.

These two, um, links are for FTP of this organism's genome. You can download the genome into your own computer for further analysis of it with your own tools. Everything from NCBI is free. You can FTP download the entire GenBank if you want to, or the entire Pub Med if you want to. We actually do that. We do a lot of data mining.

[ Class Participant Speaking ].

We do that. It's big, but it's pretty manageable now given what computers can handle. Okay. Let's go back. Look at how our eukaryotic genome look different. This is a human. This is the top part of the human genome. As you can see, it looks very different than a bacterial genome. Lots of linear, rather than a circular. Here's the mitochondrion chromosome. Which is circular, actually. If we wanted to find something in here, so -- they give you the search box up here. I'm going to do a quick search for colon cancer to show you. We won't limit it to which chromosome it's on. I clicked on ND, in the slides I clicked on homosapiens. People do this, they say, wow, cool. Look at these genes that are involved. All of these genes do have something to do with colon cancer, or it's related to some gene that is related to colon cancer. Okay, I want to look at this gene on chromosome one, how do I get there? It doesn't work that way. You have to go down here. And download it from this list down here. Okay? Yeah?

[ Class Participant Speaking ].

Actually, um, there's several different ways. Let's go back. Let's start off at Pub Med. I'll go here to gene. If you can go to a gene view of the gene that you are interested in you can get back into Map Viewer right here. It's got under the genomic context, if you drop your -- this is not the, this will take you into the Map Viewer, this will not take you to that Map Viewer home page.

[ Class Participant Speaking ].

It's not listed in a drop down list.

[ Class Participant Speaking ].

Okay, here is our gene on the chromosome.

[ Class Participant Speaking ].

You cannot. The little red markerrers look like links but they're not, you have -- what I did was go to gene and search in the Entrez Gene database. And went to the Gino Mick.

[ Class Participant Speaking ].

Not for ply gene. -- not for my gene. No, you can do a search to narrow down. Let me show you more about narrowing down. For some reason this doesn't link back. But you can go to Map Viewer home right here. This is what the Map Viewer home looks like. It lets you pick the organism from the list. We're going to go with our old favorite here. Here we are back. This search box up here is just like what Chris has been showing up, it's an Entrez search. Okay? You can limit in the same way that you limit all of your other searches. The advanced search gives you choices. This is for the people that don't know how to do bracket searching. You can put in lots of limits from this search from there advanced search. Okay? It's on the upper right-hand side of the Map Viewer page. It's just this button here. Okay? You can restrict the feel that you search in. You can restrict it to an chromosome, or an assembly. These are different genome projects that are starting to bring in new data. Okay? These are each separate assemblies, separate distinct genomes that are assembled by different groups.

[ Class Participant Speaking ].

Remember we're not all twins. So there's a lot of variation.

[ Class Participant Speaking ].

If it's the [ Indiscernible ] project it is one person.

[ Class Participant Speaking ].

If it's the other ones, they represent maybe up to five to ten people. You could also restrict --

[ Class Participant Speaking ].

Limit it to that, yep. You could clear them all and then select the one that you want. Up here you see selection based upon various types of things. You can select [ Indiscernible ] gene, transcription, all of these are separate maps in the Map Viewer. [ Indiscernible ] is Latin. It means from first principals. It means there's no evidence for that being a gene, is what it means. The only evidence comes from the sequence, it doesn't come from the fact that someone has cloned a messenger from it, or someone has a seen a structural RNA from it. It means that in sill co people have picked out there should be some transcript here. It is definitely a hypothesis. There's no evidence for it. You can pick out all of the hypothetical genes if you want and see all of those. Maybe you think you have a new one and I want to see if it's in that group. If the upper right corner there's an advanced fine. Choose human first, if you have on the Map Viewer home page.

[ Class Participant Speaking ].

Over there in that build 36.

[ Class Participant Speaking ].

Fish stands for technique. What that means is that people take and do a [ Indiscernible ] of a chromosome. They take human cells and squash them out. They take a fairly large clone and make it Flores. They label it to stand out. They hybridize that into the chromosome spread so they can tell where the clone goes in the genome on that chromosome.

[ Class Participant Speaking ].

This is, um, a little bit more modern than just looking at the staining of the chromosomes. You are doing molecular technique to make that clone hybridize to that spot on the chromosome. It localized DNA. You would say banding plus this piece of Floressence to tell you where that gene lies in relation to the banding. It's a type of mapping.

[ Class Participant Speaking ].

Right, on the right hand side.

[ Class Participant Speaking ].

Right.

[ Class Participant Speaking ].

Okay, we watched this video already. Unless you want to see it again, I don't think we need to see it again. We'll talk a little bit about how these things are put together. Here's where we talked about con tigs. Here you see the con tigs starting to come together. They're being sequenced. That means that you are sequencing lots and lots of different pieces all at once. You are starting to orient them by the overlaps. Put them together. Then that's the completed product. Remember that makes a difference in how you search for these things, because the accession number remains the same, but the GI number changes every time the sequence data changes. Okay?

[ Class Participant Speaking ].

Okay. Sequencing happens, let me get this board out here. Okay.

[ Class Participant Speaking ]. [ Laughter ]

Okay. I talked about con tigs. We've pulled in big pieces of DNA from the genome, separated it out from the rest of the genome. The way sequencing happening is you have these large pieces of DNA. They may be in various -- up to 500,000. So you could have very lanch pieces of DNA -- large pieces of DNA. Sequencing happens you have to have a primer to sequence. You have to have some way of starting. You need to know a little bit of the sequence within this piece in order to start the sequence projecting. That was the EST and the STSs came in. We've got the ESTs. Week look at whether they hybridized. We know where that expressed sequence tag is. That is a piece of messenger RNA. An expressed sequence tag is something that was a message cloned out and turned into a DNA. It's a small piece of DNA, usually 150 base pairs long at the most. If we can localize one of those ESTs or the STSs to one of these con tigs that gives us a starting point for priming. Now we can start marching across. Sequence here and make a new primer at this point, et cetera. For he is to start connecting the con tigs we've to have pieces that go across. You have this sequence matches, now you can march across here. Now you can march across here too. That's how these con tigs get put together. It doesn't show you all of the overlaps here. It's like a march, kind of.

High through put means you are taking lots of pieces and sequencing them all at once. If you've ever seen, I can't remember if any of the, they've been lots of different pieces done on the human genome sequencing project. If you've ever looked, there was one about tiger, it was Craig [ Indiscernible ]'s place that he established. Where they showed a picture. It was like a factory. Sequencing machines like 300 sequence machining in one huge room. They take that data, which sw automatically read by machines, and start putting it into machines into these pieces. Okay? It's just volume. High through put just means large volumes, very fast.

[ Class Participant Speaking ].

It is essentially complete now. Um, we don't know all of the variation. There still may be some small pieces of the repetitive elements that are not quite completely clear. But I think the human genome is essentially complete, I think. An enless supply of humans for us to sequence, right?

[ Class Participant Speaking ].

Oh, yeah. It's just a bunch of [ Indiscernible ]. Sorry. You have to have somewhere to start. You need a primer to start a sequencing project. Sequence has to start from a piece of DNA that is hybridized to another piece of DNA. You are always building from a five prime to a three prime direction. It doesn't start from nothing, okay? You have to your primer, you localize it. It could have been from one of those expressed sequence tags. Then you build across. You sequence across this entire con tig usinging primers across. Then you start using blasts or megablasts to put the pieces together by the overlap. Um, other technique, I can't remember if it's in this group of slides or not, it's called whole genome shotgun assembly. This technique took an isolated chromosomes and established different sequencing projects for different chromosomes. This is blasting the whole genome into pieces, and you are relying on being able to sequence accurately enough to put the sequencing together.

[ Class Participant Speaking ].

Yeah.

Oh, yeah. Yeah, yeah, yeah. Let's take a break until 25 after the hour.

Session on break until 11:25 eastern time zone.

Okay. As we start to put the genome, as NCBI -- I just turned it on. It should be on. Okay? As we begin to put the genome together, as NCBI begins to put it together. There's certain steps they take with each piece that comes in. Data is screened for contamination. Early on when GenBank was first put in there was a big problem. People would do a BLAST search with their piece much DNA they -- piece of DNA they had just sequenced from their vector, which may have been a piece of virus, they would find lots of lots of hits in the genome. They would say, what is this? When you start you are start ising with a primer, the first primer is always in the first vector sequence. You sequence across a part of that vector into the new DNA. A lot of sequence got entered into GenBank that had this contamination on the ends. People would pick it up in their BLAST searches. NCBI now before they accept any piece of DNA they screen for vector contamination and bacterial contamination. Um, and get rid of that part before they put the sequence into the database. Then they compare the sequence. If it's human sequence they will compare it to various species, mouse and rat. And ask what the sequence similarities are between the sequences, put them into the database, the homology database. Then they go into the contig, which we've talked about a little bit here. You know where these sequence tag sites are. That's why they're called sequence tag sites. They're placed in the genome, we know where they lie, we know their position. You can start to place these contigs on to the chromosome if you are do the whole genome shotgun method.

The annotation process, you annotate by putting in what you know about the [ Speaker Faint/Audio Unclear ], what you know about sequence tag sites, snips, and then into the various other features annotation feature we've seen on our Entrez results screen.

This slide actually gives you a lot of nice information about things. There's a whole series of [ Indiscernible ] sessions on the Human Genome Project. Pretty soon you won't be able to put in homo in as a symbol to search for human. You will start pulling up Neanderthal genes, I've already seen that, actually.

Mapquest versus Map Viewer. This set of exercises you will do is put into get you oriented to looking at things from a high distance. Okay? What you want to think about when looking at these Map Viewer maps is that you are looking down at a chromosome. You are looking down from a long distance when you start off. There will be lists of genes that you will see. Those are not all of the genes that are there. I will show, it can be confusing. This Mapquest link doesn't work anymore, actually.

Couldn't get it to work last night.

Okay, here we are looking down on Maryland. So we have 1-23 and what we want to see is where is Rock ville in Maryland, so let's zoom in a little bit. So we can find Rockville and center our map with relation to Rockville. Remember that when you're vusing the Map Viewer because it's easy to do that. If you don't center it directly on the region, even if you do center it directly on the region, sometimes when you zoom in, it will shoot past your gene and it won't be on that segment that you're looking at so there's two different ways of doing this, just looking at a region and seeing what genes are in the region, like this, and you could also xwo back into the search box, oops. And search for a specific address. Or in Map Viewer's case, a specific gene. So what is it that she wants us to look at? So that is Michael NCBI's address.

Map Viewer is the same way. Let's go back to Map Viewer and select " Human" again. And now let's select chromosome 3. So what we see here is the idioGraham giving us the banding patterns on our chromosome, okay? And those banding patterns are numbered, like they would be , and we have the contigs placed on our chromosome so which contig each of these genes is contained in and the contig name right there. We have the uigene STS tag sites so you can see where each of those tags is with relation to the contig, and we have gene seek, so this is one map, if you guys don't have the gene seek on the map on the right, this map that's on the right most side of of Map Viewer is the master map and that means being the master map, it's going to have all of these things out to the side that relate to this particular link. I'll show you how to change your master map in just a few minutes.

Now you see we have lots of links. His is a chromosome 3 region, and the reason that this is confusing is because they list particular genes here, but if you look at where those genes lie, it's one very small piece of each of these colored marks over here. There's lots and lots of genes that we're not seeing because we're looking from way up high. Okay? So let's drill down. This is the entire chromosome.

[ Class Participant Speaking ].

Where do you see that?

[ Class Participant Speaking ].

Oh, this is 0-200 million base pairs, yes. It's pretty much, this chromosome is hundred some million base pairs, it's more than 200 million. To tell you the truth I don't know why those particular ones pop out. So let's zoom in here, two marks on this zoom in piece. And now you're starting to see more and more genes and you're starting to see more and more of each of the genes structure. Zoom in is right here.

[ Class Participant Speaking ].

No, I didn't mark anything. I just zoomed in it. I just zoomed two buttons. I zoomed to the middle here. It doesn't matter. This region now shows you what you're seeing on the screen, okay? So those are the numbered nucleotides from the direction that are showing on your screen. So if you xwo over here and you right click, now that doesn't work. If you look up here at the top under gene seek, you see an arrow, that points down, okay? And right there, that little arrow, this is the end of this strand, okay? The big arrow is pointing up and this little black arrow is pointing down. All the genes that lie on the right hand side of this chromosome symbol here, that big black line in the middle, are transcribed in that direction, from top to bottom and on the left-hand side all of those genes are transcribed in the opposite direction, bottom to top. So you have to keep track of orientation on this map to know which end is five prime and which end is three prime.

There is a reason that you have to remember the orientation of these genes and that's because the most common use, one of the most common uses for this, for the Map Viewer, is people who will come in and ask you, I need the promoter reeming on for this gene. How do I get it? That sequence will contain all of the regulatory regions generally for this particular gene, okay? All of these lists here, they're all obvious names for things. You all are familiar with those, right? DL stands for download. And we're going to click on the download for this particular gene. KL and can it open a new window? It doesn't matter which gene you choose to download from. If you didn't the run across any or wren orientation, you didn't know which strand to get for the promoter, so we are looking at the download for that particular region, it goes from here to here. If you're going in that five prime to three prime, if you're going in the other direction, you're going from here to here , so you need to know what those are and in order to extend this one, you need to subtract the number of base pairs that you want for the promoter. If you're going from bottom to top, you want to add the number of base pairs. And it used to be that you had to actually do the math with those big numbers in order to know how much but now you can just add in the number of nucleotides that you want. So let's say that I wanted , so it tells you which strand you're on, so the positive strand is on the right hand side going down. So you're still reading from five prime to three prime, but it's still on the other strand.

So this is the top of the chromosome and this is the bottom. At least of that region that we saw, of that gene, okay? So here is the five prime end of that gene and here is the three prime end because we're going from top to bottom. If we were going from bottom to top if it was a gene from the left-hand side, we would be going from five prime to three prime here. But because there's in several ways, first of all, the gene would lie on this side instead of this side, so here is my gene and it lies on the right hand side which means it's transcribed in this direction. If I wanted this particular gene, it's transcribed in the opposite direction, okay? From the opposite strand. So this is the plus strand and this is the minus strand. And you'll see the plus strand and minus strand, that's the only clue have which strand you need down here is the plus strand.

[ Class Participant Speaking ].

DL, just push DL, yeah. If my person who asked me, I need the promoter region for this gene, how do I get it somehow I can just say okay, how much do you want? What do you think the promoter in this gene needs in order to function? Some promoters are huge, and p some promoters are relatively small. It depends on the gene. So generally, I give them as much as I can can. I mean, you can give them a huge amount of DNA but 20,000 is a good standard number to start with, okay? And I can load that now as a fast A sequence, and you can save it to {disk|disc}, or you can just display it.

I want to leave the minus sign in that box. So here is my 20,000 plus nucleotides for that promoter region. That's because you chose a different gene.

[ Class Participant Speaking ].

This is a very useful tool. There's another tool out there that I think is actually more useful for this same purpose and I use it more than I use NCBI's and it comes from the genome browser, USCS, so I'm going to take a bit of a sequence and we're going to go over to the USCS real quick just so I can show you, it's the University of Santa Cruz and they have a very googood bioin ks formatics program there. This used to be called the golden path, so they had a set of things to do to analyze that would be called the golden path , not used in that way so much anymore but it was, so now what I can do is go here to blat, and say that I want the human genome to have an " I'm feeling lucky" button actually here, so we'll push that. And now you'll see some of the differences between the way the maps are displayed. NCBI is the only map that's displayed vertically. All of the other genome maps out there are displayed horizontally.

This is the more common view, and the reason that this is actually in a way better is because if you look down here, you can now put lots and lots of ways of getting at information on there that is much more difficult when you got all of this vertical stuff up here. Okay? Now, what I want to do is go to DNA, so I've already mapped to my region on the chromosome, this is my piece of DNA that I want to go to, I can just click DNA up here and now I can say add 20,000 base pairs, five prime. I don't have to worry about which strand it's on. They've already done that for me and I can just do it. I can also put in annotations, different types of markers into my DNA when I get it back, so I could make it colored in certain cases. I can In in kstron where different things are mapped to and there's lots of different choices here that you can add in information to the sequence that you get back.

So this is a very nice tool and lots of people like this tool. They use it for getting the DNA. In a lot of ways it's much faster. With Blat, you can actually take protein said Wednesday sequence and it will map it to the chromosome as well. It doesn't matter what kind of sequence you have. It will go in and map it to the chromosome. Okay, this is a nice tool here called Sequence Viewer. Here, you can see SV, sorry. This is relatively new and Sequence Viewer is available directly from EntrezGene as well. So there's a link directly to EntrezGene. You'll see that this is an Entrez nucleotide record but this pool is built into the record itself, so that's kind of nice. And you can zoom in and out just like that. You can switch strands if you want to here. So the strand that you're displaying, this is the coding region so you've got the amino acid sequence down below the nucleotide sequence here. And actually that's zoom in. And now there's a strand switch.

[ Class Participant Speaking ]. Yeah, there it is, okay. Sorry. The nice thing about this alignment is that it gives you the amino acid alignment right below your nucleotide alignment so people like to have a view like this because they may be looking for restriction enzymes cut DNA in particular sequence patterns, so most restriction enzymes that are used , recognize a fixed nucleotide sequence and cut either before or after or in the middle of that particular sequence and what the way they cut is that many of them will cut and leave an overhang, so they will cut after four and leave an overhang of two or cut after three and leave an overhang of three, and that way you can take another piece of DNA cut with that same enzyme and join it in and clone it. That's how cloning is done.

So people like to be able to look at with relation to the coding region, where they're cutting, because they may want to express this protein in a bacterial cell in order to be able to purify the protein. So this is a nice little to into about. Let's look up here. You've got maps and options over here. Let's choose that. And you see the maps that we have displayed,contig map, the unigene STS map, our gene map and the igiogram. You can change one as the master map by moving them up and down. So the master map is always at the bottom here. You can add, so if you want to see all of the beings in this region, you can do that. I'm going to to make that the master. And then you can just click " Change assembly" and then okay. Now here it is, and the olfactory receptor gene is here. It's actually RefSeq.

Okay. We're going to go here, and let you play with Map Viewer a little bit. Someone wants to know what the MLH1 gene is surrounded by. They want to obtain the genomic DNA, along with upstream data. How would you do this? Choose an organism you like, {LAUGHTER}. Just make sure it has the MLH form, or actually you could do it with your favorite gene and another organism. You don't need to use MLH1, but --

Don't look at my screen. It's not the same as yours!

[ Class Participant Speaking ].

Upstream data may mean the promoter sequence itself. It may mean what genes are upstream of my gene. It's not a very precise question. A good reference librarian would say that!

{LAUGHTER}.

Hi, Katherine, yeah, there was a short pause there in the audio. Sorry about that! But we're back on now.

[ Class Participant Speaking ].

{LAUGHTER}.

They look different.

[ Class Participant Speaking ]. You see how this gene is capitalized, this takes you into a different map, so this gives you the genetic map and the gene sequence and this one takes you into a different one of those list of maps.

[ Class Participant Speaking ].

Why would I want one versus the other one?

Depending on what information you wanted.

[ Class Participant Speaking ].

Exactly.

So I'm not sure, click on that.

So it's a different set. It's some of the same information but it may be laid out in a slightly different way and it will have the gene characterizations rather than the NCBI gene characterization s.

It's probably going to actually do that.

[ Class Participant Speaking ].

Four to five prime end. Well, for this particular gene, it's not going up. If the gene was on the other side, it would be like going down. If you think about going from, so the reason it's called upstream is because all of these proteins, all of these enzymes that work on DNA go five prime to three prime so you go upstream with relation to where everything else is moving.

[ Class Participant Speaking ]. This is the top of the chromosome and this is the bottom, or the top of the gene and that's the bottom. Okay? And if you had chosen from that particular gene, that thes the extend of the gene, so what you want is of the things that are upstream or five prime to that, and if you're on the plus, that means you're on the plus ban and going to subtract nucleotides from that lower number, to go upstream. Okay? So it's just a random number?

[ Class Participant Speaking ].

Some people want 5,000. It the just de pepds on what the purpose is. There are different types of gene regulation that the occur in promoters and outside of promoters, so most promoters are anywhere from 500 base pairs to 1500 base pairs upstream, but there are also things that are called enhanceers that make the transcription of a particular gene much higher if that sequence is somewhere within 25-30,000, that I was getting the things that might be there that they would want. And to say that the promoter region has all of the regulatory information per gene is not true actually because there's a lot of regulatory information sometimes built into the messenger RNA, there's information that may be in the in trons of those genes, there's lots of information in there but the promoter reeming on is what most people are looking for because they want to know about how the transcription inter abouts with that upstream DNA or their gene.

so what you want to do is get the five prime sequence now like I showed you so take the download, the DL, yeah, that's it. I don't remember what stands for. It's protein, the results page for that protein. So what you want to do now is decide, so someone asks you to get the upstream DNA, okay, so you want to see that you're on the plus strand here and the upstream DNA would come from this region, so this is the five prime end of your gene at the top. So then you want to subtract nucleotides from that. And then click kiss play and it will give you, click display and it will give you the format for that group of sequence.

[ Class Participant Speaking ].

Let me, I'll help you with this at lunch time, okay? So did everybody get the information that we wanted that the was to answer the question? Do we know what the answer to the question was some do we know what the question was?

{LAUGHTER}.

Okay, so somebody wanted to know about obtaining genomic sequence data along with some upstream data. Upstream means five prime, like I said, because everything flows downstream. On these maps five prime to three prime and was everybody able to go upstream and pull up DNA sequence and pull out a promoter? Yes?

So what you want is to find out how much DNA sequence they want from that five prime region, so ask them specifically, and if they aren't sure, give them as much as they, you know, give them a lot just so they have everything they might need. And in this case, you're subtracting --

[ Class Participant Speaking ].

It was actually 20,000 that I took. Yeah, and then just hit the display. It's set to become to give you the Fast A format sequence there and you could also e-mail to them and save it. They won't really care what the format is that you give it to them in. All they care about is getting that sequence.

[ Class Participant Speaking ].

You can't really, you can't really pull-down the sequence data from there, but the nice thing about that with regard to information is that at that point, you know in the nucleotide sequence where the translation start side is for your protein. So then you know exactly where your protein starts, your messenger RNA will start somewhere five prime of that, and then your promoter region will be more five prime of that, okay? The reason they use that particular map or sequence, you're more likely to find good restriction sites within, with relation to your protein map. Unfortuntely, NCBI does not do restriction mapping, so when I was doing bench work, I knew the recognition sequences for lots of different sequences, I could spot them in a sequence just by scanning it. Okay? You don't have to adjust in this case. It's in the correct directionality. if your gene, your protein was coded for the other strand, then you would have to simply add instead of subtract, so you would add nucleotides to that bottom box and that would give you the five prime sequence for your gene.

{LAUGHTER}.

Well, when you hit that DL button, the only thing that it's downloading, that region is the extent of your coding sequence on the DNA. So it's your gene. So that nucleotide that's at the top there is the five prime end of your gene and that nucleotide that's at the bottom is the three prime end of your gene.

[ Class Participant Speaking ].

But then you're handing, I mean and I always have a problem with librarians who go out and do a search and then do what I call a data dump so they do a search and they find 6,000 articles that have to do with the subject and then they say, here is your results, instead of going through and making sure what they are picking at is actually what the person asked for. So don't do a data dump, if you can help it. Make sure, ask what the person actually wants, and then try and give them what they want.

We'll go through this pretty quick. These are types of maps. We have genetic linkage maps, pl maps, there's different ways of mapping. Thecytogenetic map, you can see that on the Map Viewer, those are the banding patterns that you see on each chromosome.

The genetic linkage map has to do with the fact that DNA allows recombination so chromosomes will recombine with each other and exchange pieces and because of that fact, if you have a mutation on one chromosome and a normal gene on another, and then you get recombination, you've got certain, let me explain this better. If you've got a chromosome here, and there are three different things that you can distinguish on it like STSss or like normal eye color, normal hair color and normal finger length, it doesn't matter what it is and then you've got another thing down here where you've got, looks like this is the same. If you have an exchange here, you're going to come up with a different set on that chromosome.

This one will have that mutant that's on there on it and this one will look like this one did before, okay? What you want to do is look at, so you need markers all the way along that are actually different so this marker should have been all different. When you do that exchange, you're counting the numbers of times that you get New combinations in the offspring and by doing that, you can tell how far apart those genes are on the chromosomes because of the frequency of that exchange. So it's simple statistics, how many times did this exchange occur between a particular distance. That make sense?

Physical maps are done in a much different way. Originally, physical maps were done by taking mouse cells and culture and human cells and culture and exposing them to the virus, some leaky virus, that would cause their membrane s to cuse, and the cells fuse together and you've got one cell that has both mouse and human chromosomes in it. You let that cell continue to divide and what happens is that mouse cells starts throwing out the human chromosomes. Because we can distinguish mouse from human chromosomes, we can follow which traits, if we have particular biochemical traits or we put markers, now we have molecular markers on those chromosomes, you can follow which chromosome has that particular trait. Okay? Does that make sense?

So then eventually these cells become stable with one or two human chromosomes in them and they just continue to divide and culture with one human chromosome and the rest the mouse genome.

There's also a way of mapping called a radiation hybrid map, and you do with that, you can do it with just human mapping or you can do it in the same way and in that case what you do is hit the cell with high amounts of radiation. Not enough to kill it but enough to break up the chrome O O som. You're wanting to cause large mutations, deletions, translocations and those kind of hinges so as these cells divide then you start to narrow out where those mutations are and again with the appropriate biochemical markers on those chromosomes, you can see when this trait disappears if you either translocated it so that it segregates with a different chromosome, you know that it's on that piece that translocated or if you deleted it, you know where that deletion is. So it's a different way of actually physically mapping that piece of DNA.

Sequence map that we have now is of course the best. It's the highest resolution, you're down at the nucleotide level. You're not just mapping things based on markers. You're actually seeing gene itself. Here is the cytogenetic map and here is where we talked about the P and Q arms. And then the banding patterns and here is about recombination. You may, these terms aren't used much anymore because of the , we is the sequence for so many organisms but things used to be called distances between chromosomes used to be, or between traits, between genes used to be mapped which was simply a percent recombination between traits and the human one organ generally is one, what does it say here? Centi Morgan is 1% of recombination and I think it's mapped out to one. Here it is. It's mapped out to one megabase.

Radiation hybrid map, the yeast artificial chromosome maps, you can build chromosomes so you can take human sequence, but a Centromere sequence into this large piece of human DNA and it will segregate in yeast as a chromosome in and of itself. You have to put [INAUDIBLE} on the end so that it reproduces properly, but -- I think that we'll skip by this. This is just telling you that the sequence map is the definitive map. Why do we have all of these different maps? It has to do with the techniques that were available to do these things at the time that the map was done. So we have the best map now because of that. But before all we had was percent recombination. That was actually Sturdevont in 1913 came up with the first genetic map. Here you're getting at another promoter so we want to find another promoter? I think it's good.

Okay, I have an example that I want to show you, so I'm going to go to a new page here and just go to gene and pull up our example from that first . So I'm going to go into Map Viewer and here is my gene, you can tell by the gene name. I want to know which direction it's transcribed in because I know that I want to get at the five prime end so I know that genes that are listed on the right are transcribed from top to bottom. I know that my five prime end is this end right here. You can see that there's another gene here that's actually in the introns that's transcribed in the same direction and it's these pieces are in the intron of MLH1. And I know that what I want is sequence so I'm going to go here to the download page and I know that I'm on the plus strand which is that right hand strand and my five prime end is up here, so it's the smaller number at the top there, so let's say I'm going to take 30,000 just because this person wants to go way upstream. It doesn't matter, either way. So I'd probably get past that and beyond with that. But so now this is that five prime end of that sequence, and it tells you exactly how much you took right there .

[ Class Participant Speaking ].

This is one of the most common questions that you'll get. People want the promoter region for their sequence for their gene.

[ Class Participant Speaking ].

It depends where it is. So you see, each of these little bars is actually an icon for the MLH1 gene. So each of these thinner lines is the intron space. Remember what the intron and exons are? Intron are the piece of the message that's gotten rid of so this gene, if you look, is actually, here is an exon from our MLH1 gene, and here is another gene that's within that intron here and here is part of that same gene probably that has another alternate splice in the next intron for MLH1.

[ Class Participant Speaking ].

Oh, sorry. This is actually on the left-hand side, so it's on the other strand. It is transcribed from the other strand, but this, so here is our MLH1 gene, with its introns and exons. Here is, go ahead.

[ Class Participant Speaking ].

Oh, RefSeq simply means the transcripts that are showing on this, the genes that are showing on this are RefSeq genes. They' refrom the RefSeq database. Okay? If I have a transcript, so --

[ Class Participant Speaking ].

Oh, here?

[ Class Participant Speaking ].

Each of these thicker bands that go this way?

[ Class Participant Speaking ].

Yeah, they're the exons. Okay? In between are the introns. MLH1 is a very large piece of DNA in the genome. The mature transcript is a lot less large, because so much of it is spliced out. All that will show you is , well, it may give you, I also forgot to tell you, you can move up and down the chromosome using these numbers so if you add numbers to here, we'll move down the chromosome, so you can march down the chromosome by doing that. So here I am now at the center of my gene where I was up at the top before.

Okay. Why would you use the Map Viewer? There's lots of reasons to use Map Viewer. This particular example I'm going to show you is kind of a personal example, because I have a son who has Eller's Dan kslaw's syndrome which is called hyper mobility syndrome. You may have seen the rubber band guys, it's a mutation that affects collagen, so your joints are not, my son's shoulder is [INAUDIBLE} and it happens. And it's an interesting genetic problem here, because Eller 's Dalow's syndrome, it's called by this gene that controls how collagen is used, and if you look right up here, there's a XA, this is a non-functional gene and because it's so close here what you get is a lot of recombination in there that shouldn't happen. So because they're so highly alike, you get mutations here because you recombine out and make a deficiency in this region. So you get a non-functional gene here and another interesting thing is that right down here or some of the HLA genes, I'm a diabetic and this is one of the hotspots for diabetes, so this region, it's probably somewhat affected in my family in some way. We don't know yet.

This kind of thing can be useful for people to know about , and to look at and understand how these things can occur and what they look like. So even for a patient that you can make them understand what may have happened and why and where things are coming from. My son has missed three months of school this year because of this syndrome. It causes migraine headaches and joint problems and all kinds of things. Okay!

Let's finish this up. You guys want to do another, find another promoter region just so you can get used to doing it? So go to the per problem that is on that, I can't remember which one it is. Let's do this quickly and then we'll take a lunch break. And then this afternoon, we're going to do more on, we'll do some user questions that you guys will do so you get more hands-on experience, and then we'll talk about how to use this information in your library and how Chris and I are using it in our library. And the kinds O of things that are happening at our library so we'll give you a chance to talk to us and talk to each other about ideas and opportunities.

{LAUGHTER}.

If you guys are ready for lunch, feel free to go. Let's try and be back here by [INAUDIBLE}. Katherine, I think he said 12:35? The audio was very faint and unclear. I apologize.

Right now, it depends, so you want to get five prime sequence, right?

Right.

Yes, you need to adjust up farther upstream, so you know that it's on the plus strand, so you need to subtract. It's going to depend on what research you want. Not change region. You don't want to hit that.

Oh.

You want to either hit display or save to disk.

Okay.

[ Class Participant Speaking ].

What you could do is simply take this region and go back and blast it so you can see where those pieces line up. So if you were to take whole piece and blast it, you would get lots of chromosomal region so what I would do is if you wanted to find out where the transcript started is to limit the search to Mare M a, so that way when you do the blast, it will line up exactly where the transcript start side is for that piece. Does that make sense?

Katherine, do you need me to keep transcribing or are you going out to lunch?

No, that's if you do a blast search.

[ Class Participant Speaking ]. What I would do is go over to links and go to gene.

Thank you so much, Katherine. Have a great lunch! " See" you this afternoon!

Class is taking a one hour lunch break and will reconvene at 1:35 p.m. Eastern Time Zone.

.

Please stand by for realtime relay captions. Please stand by for realtime relay captions.

Not yet. I've got one thing I want to go over, I made a mistake. I want to make sure that you're not all confused with that, I made a mistake. Generally when I am asked to get a promoter region I go to [ Indiscernible ]. When you go into NCBI Map Viewer -- I'm going to pick a random gene here just to show you. You do have to pay attention to the directionality of these arrows. Let's zoom in here. Actually, let me go back and show you the gene we did the original stuff on so you can see exactly -- okay. I know I could totally confuse you all. When you come into the download, I actually did it right for this one, when you come into the download area it always comes in on the plus [ Indiscernible ] by default. It doesn't choose the strand for you. So you have to pay attention to this display because which side your transcript is on matters with regard to the direction of transcription.

[ Class Participant Speaking ].

Yes. This particular transcript you can see is on the left hand side of that line. It's from bottom to top, okay? When you come into the download viewer it's always on the plus strategy. If you want you either have to adhere to get [ Indiscernible ], or just change the strand. [ Change region, yeah, right, sorry. Now we're the other direction. We have the minus strand up here. We could just adjust at the top there. By subtracting from the sequence.

[ Class Participant Speaking ].

No. Right. The right-hand side of that is always plus. The left hand side -- right.

[ Class Participant Speaking ].

No. Top to bottom is always plus. Bottom to top is minus. You can see why we use [ Indiscernible ] genome browser instead of NCBI for downloading.

[ Class Participant Speaking ].

No, because remember these strands are called antiparallel. Each strand --

[ Class Participant Speaking ].

Each strand runs into a three to five prime direction. The not dependent on the other strand to be three prime to five prime, okay? We've got two strands of DNA. Okay? As a double ee lynx -- helix. Okay? Each side is five prime to three prime, okay? This strand goes this way, and this strand goes this way five prime to three prime.

[ Class Participant Speaking ].

Here's our middle line. This is the plus strand, this is the minus strand. This is the five prime end and this would be the five prime end of genes transcribes from this way.

[ Class Participant Speaking ].

Right.

[ Class Participant Speaking ].

Right. In that top box.

[ Class Participant Speaking ].

Then you put a plus on the bottom box.

[ Class Participant Speaking ].

Right, right.

[ Class Participant Speaking ]

This is convention. Right. Plus the, generally people have always called the coding strand the plus strand. This was a convention that this right hand side would be the plus strand. I don't know why. It was just a --

[ Class Participant Speaking ].

So --

[ Class Participant Speaking ].

To get the five prime end, yes. You add to -- actually. If you see a minus here, I've changed it. This comes in as plus by default.

[ Class Participant Speaking ].

Okay. If I go back to the plus strand here, there's not a simple rule. Expect that you know that -- except that you know that if you have an arrow that goes up, you come in this way you adhere to extend the five prime en. If you have an arrow that goes down you subtract to get the five prime area.

[ Class Participant Speaking ].

No. All right. Okay. Everybody got that? We can all get promoters.

Okay. Now, we're going to force you to move so that you can solve problems with each other and get to know each other a little better. Um, so, um, just find a nice spot that you want to sit with a computer and answer the questions that are assigned to you. We'll go through them as we go along. As you are doing them if you have questions raise your hand. We'll come to help you if you run into problems, okay?

Let me just say a couple of things about these questions. This is the module that is called the user questions module. The second slide in is the user questions grab bag. These are the numbers of your questions correspond to. I just want to point out this one thing, for those of you who are looking at number 7, that, um, that blank space there is really meant to just say "blank." It's not meant to say that you literally blasted a bunch of judge scores, it's just a place holder, okay? We're not doing number 17 as a group. We'll do that later on together. Nobody is assigned number 17 as of now. -- it's not meant to say that you literally blasted a bunch of underscores, the just a place holder, okay?

Group working on individual problems. Group working on individual problems. -- -- --

We're going to start with number 7. Which was when I BLASTed blank against the nucleotide NR database a number of the bases were replaced with [ Indiscernible ], why did that happen? And how can I see my complete queery sequence? Let's hear what we're thinking.

[ Class Participant Speaking ].

So you wanted to go back to the IUPAC table, you knew that the letter N meant there were substitutions. There could have been multiple bases for that. Great that you remembered that N had meaning to. You went back to the IUPAC table in the slides. Let's see. Where was that? This IUPAC table. There's two IUPAC tables in the slides. Right, N meant it could have been substitutions for any of those, right? Sure, there's other ways to get to this besides using the slides, for sure. Okay.

[ Class Participant Speaking ].

Once you knew that was what was going on, what is next?

[ Class Participant Speaking ].

Okay. The idea that you would need to modify your BLAST search to show you what was going on in that section that was replaced about Ns.

[ Class Participant Speaking ].

Right. Okay. Okay. What did you want to go ahead and change, you thought? Sorry, I missed that.

[ Class Participant Speaking ].

The question had said they used the nonredundant nucleotide. That's why I changed it to that.

[ Class Participant Speaking ].

Yeah. Yeah, okay.

[ Class Participant Speaking ].

What this is, sorry, that's an old question. What used to happen when you did a BLAST search was that this filter was on. And it would replace anything that was, um, highly repetitive with Ns to let you know it had filtered those out in the search. Now it puts them as lower case. You don't see the Ns anymore. You still see the sequence. Right.

[ Class Participant Speaking ].

Not anymore. But people will come in and ask you why did it change my sequence to lower case here. Okay? If they use something that has a filter like this, it will do that. People will come in and ask you why am I seeing lower case instead of my queery sequence?

Okay. You also had question 12. Which, I believe we had a couple of people working on question 12. You will get your chance to share on 12. They will get their chance. You will see if you used the same strategies or different ones. Question 12 is that you would like to down low the [ Indiscernible ] for the human huntington's disease gene, plus a thousand base pairs up stream and down stream, and how do you do that? So Brian, how do you do that?

[ Class Participant Speaking ].

We went to Entrez Gene. Let's go ahead and do that. Let me just get over there real quick. We're in Entrez Gene.

[ Class Participant Speaking ].

Limit to --

[ Class Participant Speaking ].

Excellent strategy considering HD is a very short nondescript set of --

[ Class Participant Speaking ].

Now, wait a second. Did this happen to you too? Yeah, okay.

[ Class Participant Speaking ].

Right. It is for gene name pulling from the also -- sorry this is not working anymore. The also known as field down here, as well as the actual official symbol field. Great. You found the gene.

[ Class Participant Speaking ]. [ Laughter ]

You went to Map Viewer over here on the right? Okay. You could also have gotten the Map Viewer in the body of the record. There's the link to take you to Map Viewer. There's many different ways. You came to the map.

[ Class Participant Speaking ].

Clicked on DL for download. Okay. Adjusted them to --

[ Class Participant Speaking ].

Okay. Okay. And then what?

[ Class Participant Speaking ].

Would you give it to them in this format? Or would you do something else?

[ Class Participant Speaking ].

Not really a trick question. You could actually just copy and paste this. It is displaying it in the fast A format. You could copy and paste this and give it to them. Or you could send it to the file to save it. Really depends on how your email system works that you are using to send it back and forth to the person. Great, great work you guys.

If you minus 1000 and plus 1000 will it cancel each out? The answer is no. It is taking 1000 on one end and 1000 on the other end. It is extending it both sides by a thousand. That's what it is doing. Right, right.

[ Class Participant Speaking ].

Uh-huh.

[ Class Participant Speaking ].

I'm just looking at the numbers in the range here to see how they compare to the numbers in the range that I had up here as well. Go ahead.

[ Class Participant Speaking ].

We look at that arrow.

Here. Just talk to it, I'll hold it.

Transcription is from top to bottom here. The plus strand is the --

She's talking about taking the thousand, just because I changed the number it's not applying the number [ Speaker Faint/Audio Unclear ]. Is that what you are saying?

That's what I think [ Speaker Faint/Audio Unclear ].

Let's go back again. Before we change it by a thousand each the position that we're at is 6206 and 5484 for the gene itself, right? If somebody takes a mental note of those numbers. And we go ahead and then say that we our thousand. I understand what you are saying. Do we have to do that change region to register our changes or not. Let's display it without making that change first. Okay. What were the numbers we had before?

[ Class Participant Speaking ].

It's not giving us the same numbers, okay. Right. So let's do --

[ Class Participant Speaking ].

You are right. We need to click the change region and strand to make that go. Change region. Okay.

[ Class Participant Speaking ].

Okay. Okay. Yeah. Now those numbers are different. Yeah. Good eye. Good eye.

[ Class Participant Speaking ].

Minus, right.

So the question was if we just wanted to add up stream to this particular gene because this gene runs 5 prime to 3 prime we would go ahead and just subtract [ Indiscernible ] by 20,000 on the top end of it. Right, okay. I think somebody else did question 12, was that you guys also?

[ Class Participant Speaking ].

That has to do with relation. So I think this has to do with relationship to the [ Indiscernible ], up at the top, the piece that you are pulling out from that contig. This has a relationship to the, that's the contig, actually. That may be with relationship to the entire chromosome. You are looking at the entire chromosome sequence, you are pulling out that segment. Within the contig itself. That's where you are getting the sequence information.

[ Class Participant Speaking ].

All right. So you guys also did question 12. Anything different about your approach. Did you also start with Entrez Gene and look for the gene first?

[ Class Participant Speaking ].

Okay.

[ Class Participant Speaking ].

Not particularly more efficient. Some people might have gone straight into looking for the gene in Map Viewer directly. People could have gone, if people were comfortable with OMEM they might have been there looking for that gene, because it's a disease-specific gene. So I don't think there's any approach that is more efficient, necessarily. Just that you could have come at it from any number of tools that are in there.

[ Class Participant Speaking ].

As all of the contigs are laid. Why only one contig? Because of all of the pieces being laid against each other. The piece that had that particular gene happened to that one of the piece. If a gene was overlapping two contigs it probably would show you boat. This one just wasn't doing --

[ Class Participant Speaking ]. [ Speaker Faint/Audio Unclear ].

I think the answer is no, I was thrown by that at first.

Researchers will never use the phrase genetic sequence. Genomic sequence is whether it's a piece of [ Indiscernible ] or a genome. It's the layout of the entire genome, right? You could pick out CDNAs that are just messenger RNA. You specify what you want by saying Jen owe Mick [ Indiscernible ] versus [ Indiscernible ], versus [ Indiscernible ]. That will determine where you search.

Yeah. Actually in the gene record you might even get a better sense of that too. You know, this section that you are looking it labels it as the genomic, if you were double checking yourself. In the context of this you can go, this is genomic data. That can be reinforcing for you. Okay. Great. [ Audio Cutting In and Out ].

Yeah.

Yeah.

Okay. That was question 12. Question 7. Okay, now we'll go back and do chronological order. So question number 1. Going back to the beginning, how can I retrieve the previous version of the sequence data in access U46667? Question one is in the back. Let me go into the core nucleotides, you are saying.

[ Class Participant Speaking ].

So I clicked on the black bar to pop myself over to nucleotide. You can see we're in core nucleotide. What from there?

[ Class Participant Speaking ].

Okay, we just enentserred U46777. Okay. It knows how to deal with those, okay. You open the complete record by

Okay, so next to the number, have you the link to the reports, and it went to revision history, okay? So you can see that, we actually go ahead and open it since they want to retrieve the previous version of the sequence data. So anything else? If you want to see what was different about them, remember you can do the " Show differences" be khur. So it just, again, it gives you a chance to see anything in particular that changed over the time for it. Oka? Great! But let me go back.

[ Class Participant Speaking ].

Hats a good question. It defaults to selecting those two versions pu I think you could if you wanted to compare the two dead versions. I think you can only compare the live one with the two dead ones but let me see. Let me try to do it. I've never actually, sorry, is that what I'm doing?

[ Class Participant Speaking ].

So if you wanted to compare one and two, so the one and two positions I guess are just the comparison. I've never botherred To compare anything except the live version and the original dead version or the latest dead version, but at any rate, so you see the massive difference here between them knowing for sure that it started with one versus not being sure there, so that does make a slight difference. Okay, great job!

Okay, Question 2 is a multi-part question involving some science learning and something so the record with the number , U67889 has a GC signal, a. Ata signal, a five prime untranslated region. Oh, wait I should just say five prime UTR, and three prime UT R, and notated in the be khur field. Where can you find the answer to what those things are some and what are they for those who want to know? So where can we find the answer to what they are?

[ Class Participant Speaking ].

What database did you search the number?

[ Class Participant Speaking ].

Okay because I want to repeat what you all did. So you went to all databases?

[ Class Participant Speaking ].

We'll have lots of things hang up and do weird things. Just do them again and usually it reassure serts itself to normal very quickly. Okay, so we entered in and we have one finding here in core nucleotide. Okay, and so --

[ Class Participant Speaking ].

Yeah. No, it's a very healthy strategy to do actually, to search across all databases for that. And you know that features are a function of records submitted to GenBank, so this is a chicken saliva protein or salivary gland protein, {LAUGHTER}, okay? So we jump down in the context to the feature table. So we click on GC signal.

[ Class Participant Speaking ].

Uh-huh? Yeah, absolutely.

[ Class Participant Speaking ].

Right, yeah. Really it just says that they found that coming an experiment. Okay.

[ Class Participant Speaking ].

So you month what sequence represents the GC signal by looking at the GC signal. You still may not realize what biological a GC signal was or why anyone would care.

[ Class Participant Speaking ].

So well you know what sequence they are, but just not what they mean or do or are.

[ Class Participant Speaking ]. So we went out into books see if we could find a definition. Which one did you try?

[ Class Participant Speaking ].

So let's try, we want to try the Tata s? Did you actually type in that signal or what did you put? Although we're trying very hard to answer some of these questions with with NCBI resources, certainly the background of the biology you could answer using any number of particular tools. Did you follow any of these out that you recall?

[ Class Participant Speaking ].

Sorry. key terms was it? So that's really helpful, right?

{LAUGHTER}.

[ Class Participant Speaking ].

Fair enough, you tried Google.

[ Class Participant Speaking ].

So we're going to go look at the sample GenBank record by going to the alphabetical list on the site map and choosing GenBank sample record and did you jump down to the features part of this?

[ Class Participant Speaking ].

So did you go to the features explanation?

[ Class Participant Speaking ].

Okay. So the complete list of features is in the feature key reference, Appendix 3. This is the information that the people who were submitting get so they know how to describe. This is the thing that I was telling you before that even though we refer to it a bunch of times you never kind of find it organically like you have to remember where you found it.

[ Class Participant Speaking ].

Did I really? {LAUGHTER}. Oh, that's kind of funny. Okay. Let's see, where is it? Let me just hold on a second. That's still not what I want, is it? So but you're right. It's still not really telling you you that much. It's Eling you how they need to format it for those but it's not fully --

[ Class Participant Speaking ].

Here it is. Yeah. So here it is. The definition of the Tata signal. Is that really, the Goldberg box?

[ Class Participant Speaking ].

Okay, so there we go. Yeah, the rest of them will also be here as well, great. So this is one of those things that again people P if we're working with people who don't do submitting and are just reading everybody else's records but not doing their own submitting, you know, the specifics of how NCBI names and describes those features, it is all covered here. The reality of it is probably nobody will backtrack their way to the feature table to find this. They will already know if they're a scientist they will already know what a Tata box is. They won't be asking what that is.

[ Class Participant Speaking ].

Yeah. Okay. Was that it? I think so. Okay.

{LAUGHTER}.

Did you have a question?

[ Class Participant Speaking ].

Okay, so who has number 3 some okay, so Number 3 is, good grief! Hold on, let me just get the question. I'd like to retrieve mouse MRNA sequences for the Huntington disease gene but I don't want any express sequence tags in the results. Ladies?

[ Class Participant Speaking ].

Okay. So, we went to [INAUDIBLE} inheritance in man. Let me close this window. So did you do the search like this or did you use the limit to limit to mouse as an organism some

No.

You just typed it like this? What, Huntington, the gene name, not the ka seize name, right?

Right.

Okay. Did I make a typo, is that what I did?

[ Class Participant Speaking ].

That's why I was asking, did she want the disease or the gene? I Couldn't quite -- there was a length limitation back when they were naming the gene. But anyway, please continue.

[ Class Participant Speaking ].

So you notice that it ignored my mouse organism search and you'll see how it deals with that later. Anyway, you took the first record you said?

[ Class Participant Speaking ].

Okay.

So you figured out that the gene name was HD, okay. So you clicked on the gene map locus.

[ Class Participant Speaking ].

Okay. Did you go to the mouse gene here or you actually went into RefSeq separately? I just want to make sure I understand what you did.

[ Class Participant Speaking ].

While we're here because I don't think we have a lot of things to take us into OMEM. Let me just say a few words about this. So you see along the side you actually have a lot of different places you can can go like we've seen in every other thing so one of the things is that you can go to the gene map and there is actually in the maps an OMEM, can you guys go over that some

[ Class Participant Speaking ].

There is a map specific to OMEM in the maps as one of the versions you can look at but at any rate when you go to the gene map locus, here, it is going to most of the time if there's mouse data, there would be a column over here where you'd have this but it's not going within NCBI, because the mouse people are mostly at the Jackson lab and you know we talked about some of the different organisms that have their own groups of highly active people, so this is actually taking you out to that mouse p group. So I just wanted to kind of point that out because the question was asking for mouse so had you been in here the tend eps it might have also been to follow the mouse link and see what you were getting so you went to RefSeq from here? So did you click on, what did you click on to go from here to RefSeq?

[ Class Participant Speaking ].

So you clicked on the NCBI symbol? You used the alphabetical list to get to RefSeq. There's many ways that you you could have done that from where you were. You could have gone back directly into, you know, the protein database and whatever approach you were going to take, that would have been fine. This is fine too so you came to RefSeq and what did you do?

[ Class Participant Speaking ].

So you ended up going to gene from here?

[ Class Participant Speaking ].

Okay. Notice that my limit to human was on here from my previous search, so let me take that off and you said you did HD and mouse? So you just did HD?

[ Class Participant Speaking ].

Okay. So right. So if you know that you're doing mouse there's a couple of things you can do. One is that you can go directly to mouse as an organism but remember I think we also talked a couple of times about EntrezGene having information with the mouse and the rat, remember, this morning? The first thing this morning we looked at that so even if you had gone to the human HD record which we had looked at before in our history, in the human record, you could actually get over to the mouse and get the mouse sequences from here as well.

[ Class Participant Speaking ].

Yeah.

Where do you want that? I thought we were there.

Homology is right there.

Right, on the right hand side.

if you click on that, it will give you said Wednesdays from [INAUDIBLE} to [INAUDIBLE}.

All right so then what did you guys do some

[ Class Participant Speaking ].

So you were in the human one to begin with?

Yeah.

That's okay so talk about how you would have done it for human, that's fine. You were here in p the RefSeq part of it?

Yeah.

Okay. And so you came in here to the MRNA, to the NM record because what you were looking for were MRNA?

Yeah.

Okay.

[ Class Participant Speaking ].

So there's all kinds of, I mean I think this actually illustrates pretty well how many different starting points you can go from and what your approach can be and it's also one of those things for saying, send me your question in advance and you try five different ways and figure out the easiest one that makes sense and that's the one you go and show your user how to do. There's a lot of people kind of doing this discovery themselves and the more they click through and click to different things the more they throw up their hands and go, well which way is the best way? That question that you all have said a couple of times like well which way is the best way is a question that you can help your users with if you've spent time going through all of these things, Eventually you'll figure out ways which are the most efficient ways for what you're doing.

[ Class Participant Speaking ].

Probably, I think for this to go into EntrezGene and either use Homologene or going over from the reference sequence that you went to was fine. Going back to this issue of ESTs and that you don't want any , if you are in RefSeq or in Entrez nucleotides in general, remember one of the things you can do is to exclude, that's weird. Sorry. Oh, we're in core nucleotide, right. Remember we said that they made it into the three components and one of the components is the ESTs and one of them is the GSSs and one of hem is core nucleotide? We were already in core nucleotide which meant by default there were no ESTs but had you been not in core nucleotide but in nucleotide in general, well it still eachs trying to switch me to it. It's like " Stop it"! It's going To keep trying to force you into it, but if you had been in it the whole package, there would have been a limit that said the " Exclude ESTs" and back when this question was written there was no core nucleotide component and the only way to do it was to exclude the EST, so another example of how times have changed for the better. Now you don't have to worry about that.

[ Class Participant Speaking ].

Well, no. RefSeq is actually better to go into. Remember that EntrezGene and RefSeq are basically the same underlying records, and only if you don't have what you're looking for in EntrezGene or RefSeq do you really want to then go out to the bigger pool. When you're searching for something that's very known, you want kind of the smallest pool possible to pick it out of. You know, it's the only thing about searching systematic reviews versus searching the whole world. You'd rather search in the narrow pool. It's that kind of mentality, and only if you don't find it in the small pool do you kind of go look at the bigger ocean or whatever.

Okay, so Question 4. I'd like to retrieval protein sequences for the family Cana day. How can can I do that? You guys have that one, right? So we're going to choose taxonomy from the list of Entrez databases. So we're clicking " Go" to be clear that we're moving over to that database and now we're on the Entrez taxonomy page. So we'll type the family name here, and hopefully we're going to to spell it right. Yeah, we can, okay. {LAUGHTER}. At the end of the day the spelling is usually one of those things that goes! Okay?

[ Class Participant Speaking ].

All right, so we, I should say what we did. I just did what she told me which is to go to the link menu and choose the protein and these are the hard one p to one links, or one to many links, so protein, this should retrieval of the protein records that have this in the family name in like the organism heirarchy.

Another way you can look at this is if you actually go into the complete version of the record. Can you guys actually do this as well?

[ Class Participant Speaking ].

Extra. And usually if you go into the complete family name, over on the right hand side, they have this lovely table of how many records you have available so if somebody iss you for the whole group and you're like well, do you really mean that you want 40,000 records? This kind of gives them an idea for how many records they might be retrieving without you having to go out and do each one of the searches directly but I agree, the way you did it is actually the most efficient, so there we go. Any questions about this?

Great work!

Okay. [ Class Participant Speaking ].

He was just saying it's impressive to show researchers this table for the organism. So Question 5. It's probably not a question you're ever going to bet asked by a patron, but if you're one of those people who likes to teach statistics about how big things are either to make people impressed or surprised this is a very good question. So it's: How many review RefSeq records do we currently have for human proteins? How many provisional human RefSeq proteins are there? Okay?

[ Class Participant Speaking ].

So going to the NCBI homepage. A lot of of you are going to RefSeq through the alphabetical list and going to RefSeq. Remember that if you actually know whether you want proteins or nucleotides, you can go into protein or into nucleotides and then you'll get that little tab for what's from RefSeq already. That is available to you too. So if you're like how do I get to RefSeq? I don't see RefSeq listed? I have to get to RefSeq. Don't panic. Just go by your data type and then focus in on RefSeq. But this is fine, yeah. Go ahead.

[ Class Participant Speaking ].

Okay. So the site contents here have release statistics. Okay, so we're going to bring down this text file which will actually display here, okay? So we have the complete directory.

[ Class Participant Speaking ].

That's a good question.

So here is the table and then the proteins table that I can't highlight in that part, but you can see there, so one of the things about the release statistics is they don't release, you know, these databases are constantly being updated but they don't release the release statistics every day the same way, so this set of release statistics was released on March 9, so the likelihood that they've released some additional RefSeq records between March 9 and today I think is highly probable, and so, can can you have anything else you want to say about this part of the question before you move on to the second part?

[ Class Participant Speaking ].

Right. You said you made another approach, okay. So what we're going to do is go ahead and do the second approach they used and then we're going to backtrack from their second approach to using what you'll tell me to answer that question a different way is why don't you guys --

[ Class Participant Speaking ].

Okay, well that's okay.

So we'll pull-down all of the databases and go to protein. Okay.

[ Class Participant Speaking ].

So you wanted O to retrieval of the records and see the taxonomy tree and pick the humans from that? Okay, okay.

[ Class Participant Speaking ].

You typeded in, sorry. Yeah, I was like what's that SP field she's talking about? I'm like I'm going to learn something today because I didn't know there was an SP field so we talked in all and then SP.

[ Class Participant Speaking ].

I don't believe so although I've never used this so I can't accept it or refute that. It makes sense.

[ Class Participant Speaking ].

We'll choose homo sapiens here, okay.

[ Class Participant Speaking ].

There's a RefSeq tab here with almost 38 thousand records. So let's help each other out here. If we were trying to figure out which of these RefSeq records were in which status code, where do we think we would be able to see which ones had that status? Do you remember?

[ Class Participant Speaking ].

So let's go into preview index. Nothing is really seem less when it comes to this kind of things but preview is a good place to start. So where in preview index did you go after that? So we changed it from all fields to filter, okay?

[ Class Participant Speaking ].

So talk to us about how you build your provisional --

[ Class Participant Speaking ].

So we typed in RefSeq and hit index, okay. So you went to the whole RefSeq set you were saying and added that to your search?

Yeah.

Okay, so -- let me make it go. And then what?

[ Class Participant Speaking ].

Yeah, Oh, I see what you're saying. Well I have the human up there too and I have, let me take out the all, so I have human and I have RefSeq, okay, which would be what I would kind of want. All right which goes back to the same number that Arpita had given us so we still have to get at this issue of how do you describe the status code for RefSeq? Does anybody remember some we kind of talked about this a little bit dancing around it a couple of times. When again for like the provisional or whatever.

[ Class Participant Speaking ].

Okay. So we actually did provisional filter and then it went up here, I think it went there, so the number of these that were provisional. There is so that we move on through this and get through the rest of the questions let me just say that there is a status , I'm trying to remember which, now I'm kind of like which field was it in? RefSeq status codes and there's a line in here so you can actual the number for each status codes but the way that Brian did it is actually a very good way to do it. It's a fine way to do it as well. I'm just going to, well potentially if you had done what Arp ita had done, you wouldn't need to do that again but if you hadn't done her step you could do that for sure so this is actually fine and you can can combine the two this way, and what I'm going to do is we'll move through the rest of these and then I'm going to bring up the rest of this. So that was Question 5.

Question 6?

[ Class Participant Speaking ].

Okay, so Question 6 is how many records are there currently in the Entrez structure database which now you could answer differently knowing Arpita's little trick. {LAUGHTER}.

[ Class Participant Speaking ].

Okay. Where did you guys go?

[ Class Participant Speaking ].

Okay, so they went to the alphabetical list and went to the statistics link and Entrez database, statistics. Okay. So here is where you learn about the strategy that she already knew about. You also learn about the filtered strategy that you were using so good, okay. You actually just then what did you do after you read this?

[ Class Participant Speaking ].

Okay, so you went to structure. Does everybody notice what just happened when I licked on structure on the black bar there? This is the structure groups homepage. It's not Entrez's structure, notice it doesn't look like Entrez structure. You can still search Entrez group within it but not every black bar link --

[ Class Participant Speaking ].

Okay, so let me click on this right above it and it will take us to the structure database. so you went to preview index and then filter and then what did you do?

[ Class Participant Speaking ].

All.

Okay.

So there's 49, 974 structure records right now.

[ Class Participant Speaking ].

Okay, so there's another way to do it. You can actually just limit to that field and type " All". That's fine. Okay, great. So there's lots of ways to do that so who has got Question 8 because we already did Question 7.

What is the official gene symbol for the human cystic fibrosis symbol and what other symbols has it been known? Tell me what you guys did.

[ Class Participant Speaking ].

Okay so we went to EntrezGene. And what did can you search? Can you actually type out cystic fibrosis?

[ Class Participant Speaking ].

Okay, now what?

Okay so you clicked on the finish record which is the human record, okay.

[ Class Participant Speaking ]. So the official symbol is here at CFTR. And then the also known as has the rest of the symbols and you could also do something very similar in OMEM with that and find most of the variations but I find that this will even have more of the variant names and they will be more careful about that so if you really need every variation, it will be good to come here, great. Okay.

Question 9. I'd like to retrieve the human and mouse sequences for the gene and compare them against each other. What is an easy way to do that some who has Question 9? All right!

[ Class Participant Speaking ].

So we need to get the sequences first that we're going to to compare so tell me where you guys went to get those.

[ Class Participant Speaking ].

So we search EntrezGene for PSEN1, so the first record is human and the second record is mouse. Great, okay.

[ Class Participant Speaking ].

Okay. Is so you have the numbers for both of these directly from this format of the record and you'll be able to use those numbers in Blast or whatever tool you're likely using to compare them.

[ Class Participant Speaking ].

Uh-huh?

So you guys actually took these sequences or you just took the numbers from these records at the this point?

[ Class Participant Speaking ].

Okay, hats fine and then where did you guys go after you took the numbers?

[ Class Participant Speaking ].

Okay. I guess I should copy and paste these. No, that's fine. I'll just pop back and fourth so you went to the Blast homepage. And assign two sequences using Blast. Okay. And were they both the kind of sequences that you wanted to use? You know, what kind of sequences were they? Were they protein sequences?

[ Class Participant Speaking ].

Well not necessarily. I mean, you're comparing [INAUDIBLE}.

[ Class Participant Speaking ].

Yeah.

{LAUGHTER}.

Well the other question is this. Depending on what you were taking, all that stuff about choosing which Blast program we're going to use still applies depending on what kind of sequence you have, so in this case I'm trying to think -- it would be nucleotide sequence, so it would be okay that you didn't change your parameters but if what you had taken, if you had gone in and found the PSEN gene and Entrez protein and taken the protein sequence for some reason you would need to change this to Blast, the two proteins.

[ Class Participant Speaking ].

That thes true it would help you that way, so you're saying you just pasted these in. I don't think, have they ever seen the output of this before ?

[ Class Participant Speaking ].

This morning? Okay.

Right.

Okay. So then we don't actually have to let it do it? The other thing that you can do is because the mouse and rat gene is in the records, whether than having to go out and do all of that again, the likelihood that you could actually go out to Homologene from the human record --

[ Class Participant Speaking ].

Yeah.

But actually in Map Viewer, you can line up --

The mouse map with the human map?

Yeah. But that might not work well for you either. Actually Blast two sequences is probably the best.

Yeah. I think part of it really goes back to interpreting their question, what do they mean by compare? What are they really looking for me to do when they say compare and clearly our question can't answer that for you. So there's aneasy way to do that. Blast two sequences is an easy way to do that. Okay?

Question 10. How many protein chains combined with each other to make the P53 tumor suppressor protein that we looked at yesterday afternoon in the structure record, what is the number for the protein record for each chain? So we went to structure. Let me take off my limits just in case I don't know what those were anymore.

[ Class Participant Speaking ].

So we open the 1 TUP record by clicking on the TDB number for it, okay?

[ Class Participant Speaking ].

Okay, so we have three chains here.

[ Class Participant Speaking ].

Okay so if you mouse over, sorry, the link to the protein.

[ Class Participant Speaking ].

Or you could go back out to the, go to the protein links, okay. And there's your three chains and their numbers. Great.

Question 11. Who has question 11? Great! So question 11 is what is the chrome O O som all location of the human Huntington disease gene, both the cytogenetic band and the sequence coordinates and what is the corresponding location in mouse?

So which would you be recommending some you went directly to Map Viewer and looked for the gene. You also went to EntrezGene. Which would you recommend if you were starting out?

[ Class Participant Speaking ].

Was it easier inEntrezGene? Let's do it the EntrezGene way first.

[ Class Participant Speaking ].

So you searched on -- as long as you don't accidentally choose the [INAUDIBLE} you're good, okay? So let me just show that. On the summary screen you can actually see the first questions answer about the chromosome location. So 4 P16.3 is the chromosome band location, right? So let me just go back and point that out for one second. So within the whole chromosome that's listed here, the actual sequence numbers are actually here, although going back to what you were saying the question of whether that's the locus for the entire, that is specific to the gene, right? But let's do that. So you went into Map Viewer, did you go in from the genomic context or from the side bar?

[ Class Participant Speaking ].

So you clicked on the CHTT in Map Viewer? Okay.

So you went in and clicked the download so you could actually see the numbers some

[ Class Participant Speaking ].

Right. that's fine, yeah. That's perfectly fine. So the numbers are here in the from and the to for the sequence location, okay.

[ Class Participant Speaking ].

Went back to where?

[ Class Participant Speaking ].

You limited the search, you went back and limited your search?

[ Class Participant Speaking ].

Which I will do in a minute. One of the things that you can actually do in here is set in your maps and options, you can have the mouse map and the human map here together, side by side and it will actually draw that comparison for you so you don't have to go back, but let's do it the way that you did it. We're all here about pointing out there's six different ways to do that so while we were here I wanted to say that you can can say, show me the available mouse maps and you can have that mouse map go next to it, anyway, but if you didn't know me and believe me most people really don't bother to get to know Map Viewer well enough to realize that anyway so having other strategies up your sleeve is actually really good so you guys just went back you said?

[ Class Participant Speaking ].

Right. So let me say some of that for the recording, before we get too far along. So right. So you looked at the mouse record and you see that it's on a different chromosome and the way that it describes the chromosome location is different, and that it's actually telling you that it's in the 20 CM, so it is a different unit of measurement than you're used to seeing so you're saying because that was unexpected or you weren't sure how to deal with that you went into the record and actually went to Map Viewer again for the mouse. Do you have the location 5 B2 or 5 at the -- so the 5 here of course is just the chromosome being repeated again so it's either the B2 position or the 20 CentiMorgan position depending on which scale you're going to use, so now you're in the mouse map and sorry, so you were saying from here, I'm trying to remember what you said from here that you were doing.

[ Class Participant Speaking ].

You're right. This p map, I'm sorry, I've xwot her on a leash.

{LAUGHTER}.

This map is labelled differently because of the way the maps were drawn differently, between mouse and human, so a lot of the work was done in mice, genetically at first, so those maps are labelled in RentiMorgan because it was actually recombination percentage between genes that were studied which you can't do in humans. and you did a good job of figuring out how to figure that part out so actually if you do go up to the top those arrowheads will allow you to move up and down the chromosome that way as well so if you just click on the arrowheads.

[ Class Participant Speaking ].

you can get to the equivalent information from the Homolo gene page and get back into this same information.

[ Class Participant Speaking ].

So, going back to answering the question, and having it on the recording, when you were looking at the two mouse values presented, the 5 B2 is actually the cytogenetic banding for the mouse and the CentiMorgan data is actually from another way of doing the mapping through recombination and distances between those.

So, we did question 12 already. Question 13. What sequence tags sites exist in the PSEN1 gene involved in Alzheimers disease.

[ Class Participant Speaking ].

So we'll go into EntrezGene again.

[ Class Participant Speaking ].

So we search for PSEN1. And the human version is the first result. So you scroll down to the marker section. So these all were coming from STS, so the assumption was made that most of these would be STS markers . So when you go to the right hand side bar and click on the U ni STS database, you get a list of references in more of the Entrez output format you're used to.

[ Class Participant Speaking ].

So in Map Viewer, you'll see a Uni STS map as well. Okay.

Question 14. What genes have been mapped to the 3 P 21.3 band of the human genome?

[ Class Participant Speaking ].

So do you want me to go to the Map Viewer? Actually, it's also right here, so start from the NCBI homepage?

[ Class Participant Speaking ].

And again we'll follow the Map Viewer link under hotspots on the right. It may not always end up being under hotspots on the right because the hotspots do change as they come up with new tools and resources so if you don't see it you can always pop over to the alphabetical site map and find it that way. Okay, so --

[ Class Participant Speaking ].

So you went to the latest build for humans.

[ Class Participant Speaking ].

So you clicked on chromosome 3 here. So another thing you can also do and you can also search on things right up in the top as well but we'll go this way, so you clicked on 3.

[ Class Participant Speaking ].

So you want the idiogram to be the master map so you can actually just click on this little arrow here that I'm mousing over that says " Make it the master map" and the other way to do that is to go into maps and options and select it to be the master map. I'm just going to move this one this way for a minute and see if it works.

[ Class Participant Speaking ].

Okay so we entered the band that we were looking for, the 3 P21.3 in the region to be shown.

[ Class Participant Speaking ].

So you're saying all of the genes alongside of this? Is that what you're saying some

Yes.

Okay. So I think there's a couple of things about that. You can see that only some of the genes along the side are actually labelled because it's that kind of level of viewing related issue that we talked about before. The other piece of it is that in terms of the way that the banding works, you can see that I had typed in 21.3, if you remember that's what I typed in but it's offering what was from 21.31-33, so these are, the following digits out from a decimal are really part of what falls within that band so it's not like call numbers where 33 is really different from 3. 33 is within 3, okay? Now, in terms of how you would actually get this whole set of genes back out of here now that you've found them, I'm just trying to think, did the question ask us to do that?

[ Class Participant Speaking ].

If someone asks you to do this, I'd make sure, [INAUDIBLE}, so that they can look through, because pulling all of these out, again, could be, it would take a lot of thought, so there probably are specific things in there and once you reach this point and you know that you can just scroll up to see the rest of that .3 region, you might put gene seek back over to the master map and that way you can see the descriptions of the genes and have some idea of what the function is so that if they're looking for a particular function, if you scroll all the way to the right you can see the descriptions.

[ Class Participant Speaking ].

No, this is not the complete list. You have to zoom in farther so there's a lot of genes in that region.

Okay, we're almost there! Question 15. Where can I download the comp pleat genome sequence for E-coli? Okay, so click back on NCBIs homepage. So we'll approach it selecting Entrez taxonomy. So do you actually spell out the E part of it?

[ Class Participant Speaking ].

Okay. And did you check it off or did you go to it?

[ Class Participant Speaking ].

You copied it and pasted it in the search box?

[ Class Participant Speaking ].

Okay.

[ Class Participant Speaking ].

All right so there's lots of variants and you might want to see if there's a specific variant they're looking for, if they' relooking for all of them, what are they trying to do with this?

[ Class Participant Speaking ].

You guys went to protein, is that what you said?

No.

So we went to the links to go to genome.

[ Class Participant Speaking ].

Right. So you can see that you can get chromosome or you can get pl asmids, and again it's kind of a question, what would they like?

[ Class Participant Speaking ].

It really does make a big difference which strain they want, because there's a huge amount of variation in E-coli. So, you really would have to be sure what they want, exactly.

[ Class Participant Speaking ]. Okay, question 16. What clinical information can I find for the MLH1 gene involved in colon cancer. Which labs offer genetic tests for that disease? You guys did too? Well no, actually, as I said there were some that had multiple people doing it and so --

[ Class Participant Speaking ].

Yeah, this is good. Remember two experimen, so we went to gene, and we searched on MLH1. Now, you guys did too. Here, yeah, you guys do this.

{LAUGHTER}.

[ Class Participant Speaking ].

Select the human MLH1 gene, okay.

[ Class Participant Speaking ].

So we jump down to the section about phenotypes and clinical information, down here somewhere? Up here? Where was it? There we go. So the finding page works well for these records because they get longer if you thing you are looking for is not in the contents list.

[ Class Participant Speaking ].

Uh-huh so all the way at the bottom of the link for the ones that have clinical significance, they usually have a link to gene test for the record.

[ Class Participant Speaking ].

Okay so for this particular kind of cancer, you are able to find testing labs .

[ Class Participant Speaking ].

Okay so in the links menu on the right here you went to OM EM, okay.

[ Class Participant Speaking ].

It did eventually go, I assume?

Yeah.

[ Class Participant Speaking ].

Right. So the articles under phenotype that we had been looking at before also have OMEM links and would take you into OMEM as well.

[ Class Participant Speaking ].

So it depends on if you're interested everyone of the diseases that relates to this gene or if you're interested particular ones, so within the OMEM records did you all actually, so you followeded the gene test link? Did you do that for any particular one? Did you do it for, you notice by the way, just so you notice, there's a couple of these that are gene records and a couple that are disease records and in OMEM you have both. You have records for the genes and records for the diseases so sometimes the test will actually be the test, a test that's more clinically, you know, oriented to look at the disease and but most of the tests will have a genetic component and be looking at the gene as well so you looked at the gene tests for that?

[ Class Participant Speaking ].

Okay, the mismatch repair cancer syndrome one?

[ Class Participant Speaking ].

Okay. You'll see that you have some other ones so it kind of goes back to once you've done some discovery, do they want to follow something in particular or do they want you to kind of gather up everything and overwhelm them with the richness of the information that you find. Whatever your normal tendency might be with that particular person, I guess. Okay, great job! Now, so you all answered a bunch of reference questions. Do you feel like now if people asked you reference questions you might have some approach strategies to use?

[ Class Participant Speaking ].

Yeah? Okay.

So for some information, background information, what kind of information?

Clinical.

So for clinical information you went to the reviewed tab?

There's a the lot of clinical information in these areas. I know our genetic counselling people like to use this quite a bit so even though we approached it from looking at the testing you still have a lot of other information in the records so that's a good point.

[ Class Participant Speaking ].

No. This is an external database that's housed at the University of Washington but had been funded by a bunch of other federal agencies, but because it has that kind of federal support they feel very comfortable linking back and fourth to it.

Okay. Question 17, we actually had you not do because a couple of things that we're supposed to do in the 15 minutes before we wrap up, one was to talk about the wealth of all of the other kind of, we said several times during this class that not everything you need is available from NCBI and I know David has given you concrete examples of that and this is another case that the most frequently used restriction resources and if you remember the restriction enzymes that I think you were shown during Day 1 that kind of help you figure out where to cut in, and all that, that's not something NCBI typically offers so when we're going to for five minutes we're going to kind of pop into these resources and if you remember this question, about how do I find restriction enzymes or how do I find the restriction map to say where the enzymes should make their cut you'll be able to use that as an example when we go into some of the resources. .

This module is called worldwide database and software directory and basically, really, the link here takes you to kind of more of a big list, I don't even think the link is on here. I don't think, it just says see the five day course, so that doesn't even work anymore.

Basically, there's a couple of them that I think you showed on the first day when we talked about the different types of molecular biology databases but over and over you heard us talking about ENBL and EBI, and that's probably one of our biggest in all of this. I think somebody asked in their questions that they sent over about multiple sequence alignments and remember we talked about multiple sequence alignments in tools? And the tool clust all W? If I can actually find it? Okay, so ENBL or EBI, this is their biocatalog search page and basically you can search all kinds of, where are we in here some we're in the query form. Let me back up a level.

I want to back up a level to give you a short URL that you can writedown. So this is just www.EBI.AC.UK, and then this happens to be the databases link. So a lot of the databases that you saw linked in the side bar of gene, remember we said a lot of those weren't from NCBI, a lot of them come from this particular group and as you get into, there's databases and then there's tools and that's the part you really need to come away. We're very used to thinking about everything as a database but they really do have these two categories of things. They have things that are databases and they have things that are tools so you'll find them as separate things.

So as you get into them, you can see they consider Blast a tool, you know? Blast is a tool that searches against different databases so sequence analysis tools, we talked about Clustall. So once you feel like you have a handle on what NCBI has to offer, then it's easy to kind of go out and start branching out and kind of comparing and all of that and what not.

All of the break stuff is here, but we're, you know, we're scheduled to end at 4:00.

Okay, all right so, at any rate, so a lot of different tools, you know, that are available here. A lot of the tools also come from other things and you'll find them linked from NCBI like you'll find the bind and you'll find all of those things, --

[ Class Participant Speaking ].

And then I think we also talked about the nucleic acids research database and web server issues. Those two catalogs of resources will really bring you to most of these kinds of resources and the molecular biology from MLA that we talked about, they're trying to figure out better ways to make sure everybody can find the best tools and frankly, sometimes I think the best way that we've approached it is to actually look at the places that have full time like the University of Washington, like stanford and go and look at how they, what tools they use, so our colleague Mark Mine at the University of Washington has a bio researcher, if I can, I'm not typing well anymore. Bioresearcher toolkit, and I don't really bother to do much in the way of organizing pages anymore because I just look at the tools that he has. But the thing you have to kind of understand when you look at these is like they do start off by what kind of bio molecule the person is working with so that first question of are you working with proteins or working with nucleotides or are you working with structures is really kind of important towards which path you drive yourself down on some of these websites. There are a lot of tools that also cut across like last cuts across various kinds O of biomolecules but that's not the case for all of them.

Do you want to say anything more about the other databases?

Another one of our colleagues has put a site up that I think is really very nice for finding things. He's at Pittsburg University Health Sciences Library. And his database is actually automatically updated now, so he's figured out a way to go in and search various databases to pull outlinks and automatically put them in and he's got a very powerful search tool that he can use in his place and it should be here.

[ Class Participant Speaking ].

So now, he can , if you do a search for transcription factors, --

[ Class Participant Speaking ].

Oh, right. Okay. It will pull out many things but it will also give you assets on the side, so you can say, if I'm interested promoters, there are tools that will deal with promoters. If I'm interested pro P carry " Ic promoters or prokaryotic transcription factors, here is that. So it breaks the search down in very useful ways. I highly recommend this site. I use it quite often.

So if you search on online resources collection in Google, and look for Pittsburg, you'll probably find that. So, the last part about this was supposed to be the well, what are you going to do when you get home kind of thing, and so does anybody have any questions about what you're going to do when you go home with all of this stuff?

[ Class Participant Speaking ].

Let us show you. We aren't going to go through this because I'll show you the module that's the outreach and communication module for the five day course for those people who have invested five days of their life in it and need to go home and like make some things happen, so let me bring that up and you can answer --

One of the things that you'll find is very difficult is living down a reputation as a librarian. Researchers are very reluctant to come to you and ask you a question that they think they know a lot more about than you do. They don't think that a librarian can really teach them anything about these kind of resources. So the first step is to try and earn a little bit of that trust in this area. So if you can show them things that they don't know P already, it's a big first step, so how many of you get molecular biology questions when you're sitting at a reference desk? occasionally? Often? Occasionally?

[ Class Participant Speaking ].

So that's the first big step is to get out there and say that we can support this kind of information, so go out and give a short class on EntrezGene. Show people what EntrezGene is and how it can work. People don't know about it. It's an easy tool to teach so if you can start teaching classes and showing people these things.

So if this is the content, there's slides on each of these including the beginning starting with outreach activities, and but so you were talking about starting teaching, right?

Yeah.

So there's a slide on workshop development, and there's some sample, you know, things you should think about when you're scheduling, when you're trying to schedule your courses, things to think about if you're going to go out to somebody's lab and do consultation requests. We didn't actually link up a lot of the teaching examples here because people will change their websites around. But PIT offers the courses, Washington offers the courses, if you're really looking for something basic, I'll show you the link to the handout that we have at Cornell. It's under the requested service and the list of library classes we have. We have the handout for the molecular biology related courses. So there's sample handouts so if you're just trying to get a sense of what kind of handouts have been used by your other colleagues just go to a couple of places and you'll see them in this, there's a list of places where people are active in doing this and all of them teach workshops and all of their outlines and handouts are online and you can feel free to ask them if you can reuse their stuff and some will actually say that they can reuse the stuff so it's all kind of here, you can just use them as examples.

Also, we've formed a collaboration with MIT librarys to start putting up captured tutorials on these type of resources, so we're going to have a wep page that will be called BITS series. All of these tutorials that we're putting up, we're limiting them to be six minutes or less long, so they are going to show a very specific tool on how to use it. So we're actually within about a week of putting up our web pages to get those out there and they will be free for anybody who wants to use them.

So wait, I don't have the URL yet because we're building the web page now to put them up.

The molecular biology page, I'm sure.

We can do that. We're actually doing a poster at MLA also. So we'll have little cards that we'll handout with the URL. If you're going to MLA, we'll be there.

Right, but the information is on the [INAUDIBLE}. And the slides, I highly encourage you to [INAUDIBLE}. These are the links to some of the places that have the most fully developed sites that you can use.

It will be off the MIT website. We'll have a poster presentation and I'll make sure it gets on the list serve.

[ Class Participant Speaking ].

Okay? Chris is going to hand out handouts, the evaluation for the course. One of the other things that I wanted to mention, even though I have a PhD in genetics, it was still the same way for r me, people see that you're a librarian, they don't think that you can teach them anything about their particular area of expertise. So, you have to go out and try and earn the trust of people. So when I started teaching classes on Blast at Harvard, I would get maybe one or two people coming to the class. But I went ahead and taught even with only one or two people because that thes how the word gets out. Somebody is going to go to that class and say, " I learned something there that I didn't know before" and they go back and start using that in the lab and people say, " How can you do that"? And they say " The librarian taught me how to do that" so that gets out, and that gets you more people coming to your classes and actually, now we have been invited into the curriculum to teach several of the courses in the curriculum because we have been doing this.

And the other thing I would say is if you're already kind of blessed to be in the curriculum of the graduate students to teach literature surfing or you're in there to teach some other thing, if you tack on, you know, 10 minutes about the databases that are linked from Pub Med in there, it will open up their eyes to go, okay, they do know something about the other resources in there, and it's a starter so even if you can't kind of feel like you can walk in the door and say that I want to teach a whole class on Blast, you might be able to kind of say look, from this Pub M ed record you can see the sequences and there's a lovely Blast link here hats a quick way to do a Blast search that's already done and those students trying to do the Blast searching for their regular work might go like I didn't know this about Blast. They sound like they know something, maybe I'll ask them my Blast related question because they don't necessarily want to ask their Blast related question in their lab to their colleagues to look like they don't know what they're doing either so the factual tell us is sometimes a motivating factor.

Right and when you're out teaching a Pub Med class, jump into EntrezGene and show them that here are, you know, the gene riffs. Here is literature on to references into function for that gene. Then they start to see what else is out there.

[ Class Participant Speaking ].

{LAUGHTER}.

Because this is a new air extra for a lot of librarians to support, it can be difficult to get started, but you have the list serves, you have us, as contacts if you have problems or questions. Feel free to get in touch with us.

You're very welcome! You have a great day! What an interesting class!

[ Relay Event Concluded ]