Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

An Introduction to Molecular Evolutionary Analysis with NCBI Datasets and Python

On July 21, 2022, the NCBI Education Team provided a workshop on An Introduction to Molecular Evolutionary Analysis with NCBI Datasets and Python. Training materials from this event are available on this page.

As they diverge from a common ancestor, species accumulate differences in their DNA sequences. Differences within a protein-coding region are classified in two types. Non-synonymous substitutions change the amino acid sequence of the protein, while synonymous substitutions do not. Synonymous substitutions are largely invisible to natural selection and tend to accumulate at a constant rate. On the other hand, non-synonymous substitutions whose effects are beneficial accumulate at a faster rate, while those that are deleterious are suppressed. By comparing the rates of non-synonymous and synonymous substitutions, we can infer whether natural selection has primarily acted to conserve the protein sequence or to adapt it to a new environment or function.

In this workshop you will learn to compare the protein-coding sequences of two species to estimate which proteins show signs of adaptation. Working in a Jupyter notebook with bash and Python, you will use the NCBI Datasets command line interface (CLI) to download sequence data, then perform analysis with a few popular Python packages. The workshop assumes basic familiarity with a scripting language such as Python or R at a level equivalent to a semester course or programming bootcamp.

In this workshop you will learn how to:

  • Search for and download protein ortholog sequences with NCBI Datasets CLI.
  • Parse the downloaded files with BioPython.
  • Identify synonymous and non-synonymous substitutions and calculate substitution rates.
  • Plot the results with Matplotlib

NOTE: This workshop is designed for people with a good understanding of molecular biology and some scripting experience.

Keyboard controls: Space bar - toggle play/pause; Right and Left Arrow - seek the video forwards and back; Up and Down Arrow - increase and decrease the volume; M key - toggle mute/unmute; F key - toggle fullscreen off and on.

Webinar Materials

Jupyter Notebook
(Note: It may take a couple of minutes to load this page the first time.)

Last Reviewed: August 23, 2022