Geometric Methods in Structural Computational Biology by Lydia Kavraki - HTML preview
Download the book in PDF, ePub, Kindle for a complete version.
Geometric Methods in Structural Computational
By: Lydia Kavraki
Online: < http://cnx.org/content/col10344/1.6>
This selection and arrangement of content as a collection is copyrighted by Lydia Kavraki.
It is licensed under the Creative Commons Attribution License: http://creativecommons.org/licenses/by/2.0/
Collection structure revised: 2007/06/11
For copyright and attribution information for the modules contained in this collection, see the " Attributions" section at the end of the collection.
Geometric Methods in Structural Computational
Table of Contents
Structural Computational Biology: Introduction and
Topics in this Module
Proteins and Their Significance to Biology and Medicine
Proteins are the molecular workhorses of all known biological systems. Among other functions,
they are the motors that cause muscle contraction, the catalysts that drive life-sustaining chemical
processes, and the molecules that hold cells together to form tissues and organs.
The following is a list of a few of the diverse biological processes mediated by proteins:
Proteins called enzymes catalyse vital reactions, such as those involved in metabolism, cellular
reproduction, and gene expression.
Regulatory proteins control the location and timing of gene expression.
Cytokines, hormones, and other signalling proteins transmit information between cells.
Immune system proteins recognize and tag foreign material for attack and removal.
Structural proteins prevent cells from collapsing on themselves, as well as forming large
structures such as hair, nails, and the protective, largely impermeable outer layer of skin. They
also provide a framework along which molecules can be transported within cells.
The estimate of the number of genes in the human genome has been changing dramatically since it
human genome). Each gene encodes one or more distinct proteins. The total number of distinct proteins in the human body is larger than the number of genes due to alternate splicing. Of those, only a small fraction have been isolated and studied to the point that their purpose and mechanism
of activity is well understood. If the functions and relationships between every protein were fully
understood, we would most likely have a much better understanding of how our bodies work and
what goes wrong in diseases such as cancer, amyotrophic lateral sclerosis, Parkinson's, heart
disease and many others. As a result, protein science is a very active field. As the field has
progressed, computer-aided modeling and simulation of proteins have found their place among the
methods available to researchers.
An amino acid is a simple organic molecule consisting of a basic (hydrogen-accepting), amine
group bound to an acidic (hydrogen-donating) carboxyl group via a single intermediate carbon
Figure 1. An α-amino acid
A generic α-amino acid. The "R" group is variable, and is the only difference between the 20 common amino acids. This form is called a zwitterion, because it has both positive and negatively charged atoms. The zwitterionic state results from the amine group (NH2) gaining a hydrogen atom from solution, and the acidic group (COO) losing one.
During the translation of a gene into a protein, the protein is formed by the sequential joining of
amino acids end-to-end to form a long chain-like molecule, or polymer. A polymer of amino
acids is often referred to as a polypeptide. The genome is capable of coding for 20 different
amino acids whose chemical properties depend on the composition of their side chains ("R" in the
above figure). Thus, to a first approximation, a protein is nothing more than a sequence of these
amino acids (or, more properly, amino acid residues, because both the amine and acid groups lose
their acid/base properties when they are part of a polypeptide). This sequence is called the
primary structure of the protein.
Figure 2. A polypeptide
A generic polypeptide chain. The bonds shown in yellow, which connect separate amino acid residues, are called peptide bonds.
The Wikipedia entry on amino acids provides a more detailed background, including the structure, properties, abbreviations, and genetic codes for each of the 20 common amino acids.
The primary structure of a protein is easily obtainable from its corresponding gene sequence, as
well as by experimental manipulation. Unfortunately, the primary structure is only indirectly
related to the protein's function. In order to work properly, a protein must fold to form a specific
three-dimensional shape, called its native structure or native conformation. The three-
dimensional structure of a protein is usually understood in a hierarchical manner. Secondary
structure refers to folding in a small part of the protein that forms a characteristic shape. The
most common secondary structure elements are α-helices and β-sheets, one or both of which are
present in almost all natural proteins.
Figure 3. Secondary Structure: α-helix
α-helices, rendered three different ways. Left is a typical cartoon rendering, in which the helix is depicted as a cylinder. Center shows a trace of the backbone of the protein. Right shows a space-filling model of the helix, and is the only rendering that shows all atoms (including those on side chains).
<db:title> Bond representation
(c) Each segment in this representation
represents a bond. Unlike the other two
(a) Different parts of the polypeptide
(b) β-sheets are sometimes referred to as β
representations, side chains are illustrated.
strand align with each other to form a β- pleated sheets, because of the regular zig- Note the alignment of oxygen atoms (red) sheet. This β-sheet is anti-parallel,
zag of the strands evident in this
toward nitrogen atoms (blue) on adjacent
because adjacent segments of the protein representation.
strands. This alignment is due to hydrogen
run in opposite directions.
bonding, the primary interaction involved in
stabilizing secondary structure.
Figure 4. Secondary Structure: β-sheet
Beta-sheets represented in three different rendering modes: cartoon, ribbon, and bond representations.
Tertiary structure refers to structural elements formed by bringing more distant parts of a chain
together into structural domains. The spatial arrangement of these domains with respect to each
other is also considered part of the tertiary structure. Finally, many proteins consist of more than
one polypeptide folded together, and the spatial relationship between these separate polypeptide
chains is called the quaternary structure. It is important to note that the native conformation of a
protein is a direct consequence of its primary sequence and its chemical environment, which for
most proteins is either aqueous solution with a biological pH (roughly neutral) or the oily interior
of a cell membrane. Nevertheless, no reliable computational method exists to predict the native
structure from the amino acid sequence, and this is a topic of ongoing research. Thus, in order to
find the native structure of a protein, experimental techniques are deployed. The most common
approaches are outlined in the next section.
Experimental Methods for Protein Structure Determination
A structure of a protein is a three-dimensional arrangement of the atoms such that the integrity of
the molecule (its connectivity) is maintained. The goal of a protein structure determination
experiment is to find a set of three-dimensional (x, y, z) coordinates for each atom of the molecule
in some natural state. Of particular interest is the native structure, that is, the structure assumed by
the protein under its biological conditions, as well as structures assumed by the protein when in
the process of interacting with other molecules. Brief sketches of the major structure
determination methods follow:
The most commonly used and usually highest-resolution method of structure determination is x-
ray crystallography. To obtain structures by this method, laboratory biochemists obtain a very
pure, crystalline sample of a protein. X-rays are then passed through the sample, in which they are
diffracted by the electrons of each atom of the protein. The diffraction pattern is recorded, and can
be used to reconstruct the three-dimensional pattern of electron density, and therefore, within
some error, the location of each atom. A high-resolution crystal structure has a resolution on the
order of 1 to 2 Angstroms (Å). One Angstrom is the diameter of a hydrogen atom (10^-10 meter,
or one hundred-millionth of a centimeter).
Unlike other structure determination methods, with x-ray crystallography, there is no fundamental
limit on the size of the molecule or complex to be studied. However, in order for the method to
work, a pure, crystalline sample of the protein must be obtained. For many proteins, including
many membrane-bound receptors, this is not possible. In addition, a single x-ray diffraction
experiment provides only static information - that is, it provides only information about the native
structure of the protein under the particular experimental conditions used. As we will see later,
proteins are often flexible, dynamic objects when in their natural state in solution, so a single
structure, while useful, may not tell the full story. More information on X-ray Crystallography is
Nuclear Magnetic Resonance (NMR) spectroscopy has recently come into its own as a protein
structure determination method. In an NMR experiment, a very strong magnetic field is
transiently applied to a sample of the protein being studied, forcing any magnetic atomic nuclei
into alignment. The signal given off by a nucleus as it returns to an unaligned state is
characteristic of its chemical environment. Information about the atoms within two chemical
bonds of the resonating nucleus can be deduced, and, more importantly, information about which
atoms are spatially near each other can also be found. The latter information leads to a large
system of distance constraints between the atoms of the protein, which can then be solved to find a
three-dimensional structure. Resolution of NMR structures is variable and depends strongly on the
flexibility of the protein. Because NMR is performed on proteins in solution, they are free to
undergo spatial rearrangements, so for flexible parts of the protein, there may be many more than
one detectable structures. In fact, NMR structures are generally reported as ensembles of 20-50
distinct structures. This makes NMR the only structure determination technique suited to
elucidating the behavior of intrinsically unstructured proteins, that is, proteins that lack a well-
defined tertiary structure. The reported ensemble may also provide insight into the dynamics of
the protein, that is, the ways in which it tends to move.
NMR structure determination is generally limited to proteins smaller than 25-30 kilodaltons
(kDa), because the signals from different atoms start to overlap and become difficult to resolve in
that range. Additionally, the proteins must be soluble in concentrations of 0.2-0.5 mM without
aggregation or precipitation. For more information on how NMR is used to find molecular
Electron diffraction works under the same principle as x-ray crystallography, but instead of x-
rays, electrons are used to probe the structure. Because of difficulties in obtaining and interpreting
electron diffraction data, it is rarely used for protein structure determination. Nevertheless, ED
structures do exist in the PDB. For more on ED, see this Wikipedia article.
Structure Prediction of Large Complexes
Large macromolecular complexes and molecular machines present a particular challenge in
structure determination. Generally too large to be crystallized, and too complex to solve by NMR,
determining the structure of these objects usually requires the combination of high-resolution
microscopy combined with computational refinement and analysis. The main techniques used are
cryo-electron microscopy (Cryo-EM) and standard light microscopy.
Protein Structure Repositories
Most of the protein structures discovered to date can be found in a large protein repository called
the RCSB Protein DataBank (PDB). The Protein Data Bank (PDB) is a public domain repository that contains experimentally determined structures of three-dimensional proteins. The
majority of the proteins in the PDB have been determined by x-ray crystallography, but the
number of proteins determined using NMR methods has been increasing as efficient
computational techniques to derive structures from NMR data have been developed. A few
electron diffraction structures are also available. The PDB was originally established at
Brookhaven National Laboratory in October, 1971, with 7 structures. Currently, the database is
maintained by Rutgers University, the State University of New Jersey, the San Diego
Supercomputer Center at the University of California, San Diego, and the National Institute of
Standards and Technology. The current number of proteins (and/or nucleic acids) in the PDB
database is displayed at the top-right corner of the main PDB page. The imaging method statistics
of these structures (i.e., which methods were used for what fraction of the structures), as well as
other classifications, can be found here. The European Bioinformatics Institute Macromolecular Structure Database group (UK) and the Institute for Protein Research at Osaka University (Japan)
are international contributors to the contents of the PDB.
Visualizing Protein Structures
Numerous tools are available for visualizing the structures stored in the PDB and other
repositories. Most such tools allow a detailed examination of the molecule in a variety of
rendering modes. For example, sometimes it may be useful to have a detailed image of the surface
of the molecule as experienced by a molecule of water. For other purposes, a simple, cartoonish
representation of the major structural features may be sufficient.
A Few Molecular Visualization Programs
Visual Molecular Dynamics (VMD) was originally developed for viewing molecular simulation trajectories. It is a very powerful, full-featured, and customizable molecular viewing
package. Customization is available using Tcl/Tk scripting. Information on Tcl/Tk scripting
can be found at this Tcl/Tk website.
PyMol is an open-source molecular viewer that can be used to generate professional-looking images. PyMol is highly customizable through the Python scripting language.
Protein Explorer is an easy-to-use, web browser-based visualization tool. Protein explorer is Protein Explorer is an easy-to-use, web browser-based visualization tool. Protein explorer is
Because Chime only works under Windows and Macintosh OS, the use of Protein Explorer is
restricted to those platforms.
JMol is a Java-based molecular viewer. In applet form, it can be downloaded on-the-fly to view structures from the web. A stand-alone version also exists, which can be used independently of
a web browser.
Chimera is a powerful visualizer and analysis tool that can be comfortably used with very large molecular complexes. It can also produce very high-quality images for use in
presentations and publications.
Visualizing HLA-AW with VMD
What follows will be a very brief introduction to what can be done with VMD. Only the most
basic viewing functionality will be discussed. For a complete description of the capabilities of
VMD and how to use them, please refer to the VMD web site.
In this section, a human leukocyte-associated antigen, HLA-AW (PDB structure ID 2HLA), will
be shown under various rendering methods in VMD. This section is intended to convey, first, a
general idea of the types of visual representations that are available for protein structures, and
second, what information is and is not conveyed by each representation.
VMD allows the user to load and view molecule description files in a wide variety of common
formats, including trajectory files with multiple structures of the same molecule, such as might be
generated by a simulation. Once the molecules are loaded, the way each molecule is rendered may
be controlled using the Graphical Representations menu:
Representations menu </db:title>
(c) Rendering methods in VMD. Which
(b) Coloring schemes to highlight
one to use depends on the features to
features of interest.
(a) This menu allows the user to control in detail how
each molecule is rendered.
The built-in rendering options of VMD.
Molecules may be displayed by various rendering modes:
Figure 6. HLA-AW. Drawing method: LINES. Coloring method: NAME
In this representation, each line represents a bond between two atoms. The color of each half-bond corresponds to the element of the atom at the corresponding end of the bond (red for oxygen, blue for nitrogen, yellow for sulfur, and teal for carbon). Line
representation gives a clear idea of the molecule's connectivity, but for large molecules it can be difficult to isolate protein substructures.
Figure 7. HLA-AW. Drawing method: VDW. Coloring method: NAME
Here each atom is represented by a sphere whose radius is the Van der Waals radius of the atom. The Van der Waals radius is half the separation of unbonded atoms packed as tightly as possible, and provides a rough notion of a collision radius, although it is not a firm barrier. This representation of the molecule gives a rough sense of its shape, and is sometimes called a space-filling model.
Figure 8. HLA-AW. Drawing method: VDW. Coloring method: CHAIN
This rendering is the same as in the previous figure, except that now the atoms are colored based on which polypeptide chain they belong to. HLA-AW consists of two chains, the alpha chain (blue), which folds into three domains and the smaller β2 microglobulin (red), which is a component of a whole class of HLA proteins. Coloring by chain allows an inspection of how the polypeptide subunits come together to form the whole quaternary structure of the protein. The black balls are water molecules near the surface of the protein that always appear in the same place in crystal structures, and may therefore be considered part of the structure for some applications.
Figure 9. HLA-AW. Drawing method: SURF. Coloring method: CHAIN
The Surf drawing mode renders a surface swept out by a sphere of some set size skimming the protein. Usually, this size is
approximately that of a water molecule, in which case the rendered surface is very similar to the solvent-accessible surface. Note that it is impossible to deduce the connectivity of the atoms from this image or from the space filling image in the previous figure.
Overall shape, rather than connectivity, is the information conveyed by these representations. Hence, both backbone-based and surface-based renderings are necessary to fully understand a protein's structure.
Figure 10. HLA-AW. Drawing method: SURF. Coloring method: CHAIN
Here the protein has been rotated approximately 90 degrees toward the viewer, so that, compared to the previous image, we are looking down from above. The deep groove running from the top left to lower right is the binding pocket of the protein.
Figure 11. HLA-AW. Drawing method: CARTOON. Coloring method: CHAIN
Cartoon rendering places an emphasis on secondary structure. Beta sheets appear as flattened arrows, and alpha helices appear as cylinders. These are common conventions in representing protein secondary structure. By examining this image, we can see that the walls of the binding pocket observed in the previous figure consist of alpha helices, and the floor is an anti-parallel beta sheet. In anti-parallel beta sheets, adjacent strands run in the opposite direction (notice the arrow points alternate in direction). Note that this representation only conveys information about the backbone connectivity of the protein. Side chain atoms are omitted, and therefore the overall shape is only a very coarse approximation.
Figure 12. HLA-AW. Drawing method: SURF. Coloring method: RESTYPE
Alternative coloring methods can provide additional insight into a protein's structure and function. Here each atom is colored based on whether the side chain of the amino acid residue to which it belongs is acidic (red), basic (blue), polar neutral (green), or apolar (gray). Note that residues on the surface of the protein tend to be hydrophilic (attracted to water, in red, blue, and green), whereas residues closer to the core of the protein tend to be hydrophobic (greasy or water repellant, in gray). This is characteristic of proteins that exist in aqueous solution in nature. Their native structure is stabilized by a tendency for the hydrophilic residues to interact with the solvent water molecules, while the hydrophobic residues are driven together away from the solvent. Clusters of hydrophobic residues on the surface often indicate a location that is usually protected from solvent in the natural state, either by interaction with another molecule or by part of the protein itself.
Visualizing HLA-AW with Protein Explorer
Protein Explorer is designed as a user-friendly but fairly full-featured visualizer. It is not as
scriptable or as powerful as some other visualizers such as VMD and PyMol, but it is one of the
quickest and easiest to get started with. It is used through a web browser, either by accessing it
through the Protein Explorer website (via the Quick-Start Protein Explorer link), or as an offline version, downloadable from this page. Both versions require the MDL Chime molecular viewing plugin, which you can download from here (registration required).
As with VMD above, a human leukocyte-associated antigen, HLA-AW (PDB structure ID 2HLA),
will be shown in various renditions.
Upon opening, Protein Explorer will load a default molecule and display it (this feature may be
disabled via a setting under "preferences" in the lower left frame):
Figure 13. Protein Explorer at Startup
The interface contains three areas. The frame on the right contains the rendering window, where the molecule is displayed. The lower left frame contains an input box for text commands and a text box that displays general text output from the program: What commands have been executed, what the program is currently doing, etc. The top left frame generally contains the user interface in the form of buttons and links. Its exact contents vary with use.
Clicking on the "PE Site Map" link pops up a window containing Protein Explorer's top-level
Figure 14. Protein Explorer Site Map Window
Each option contains a helpful tooltip which can be seen by hovering the mouse cursor over it. "New Molecule" allows the user to load a molecule either directly from the PDB or from the local filesystem. "Reset Session" returns to the default view and rendering style, which can be a useful shortcut. "Quick Views" opens up a menu from which the user can select how the molecule is rendered.
Once a molecule is loaded, the "Quick Views" menu allows the user to control how it is displayed:
Figure 15. Protein Explorer QuickViews Interface
The "SELECT" pulldown menu allows the user to pick a group of atoms based on their properties, their location, the structural elements in which they are involved, or by directly clicking them. The "DISPLAY" pulldown menu then allows the user to determine the style in which the selected atoms are rendered. Most of the styles available through VMD are also available in Protein Explorer.
The "COLOR" pulldown menu allows the user to determine how the atoms are colored. Options include coloring by secondary structure elements, atom type, subunit (chain), a spectrum from end to end of the protein, and by properties such as charge and polarity.
Figure 16. Protein Explorer: HLA-AW Backbone Rendering
This rendering mode shows the protein backbone (no side chains) through the alpha carbons of each amino acid residue. It gives the user a sense of how the chains fold to form the structure, but not it's full shape, since all side chain atoms have been removed. The yellow bars are disulfide bonds, which are covalent bonds that lock distant parts of the chain together to help maintain the structure.
Figure 17. Protein Explorer: HLA-AW Cartoon Style
Cartoon rendering works as for VMD. As in the backbone rendering above, side chains are ignored, and the protein backbone is rendered as a smoothly curving tube. Beta sheets appear as flattened arrows, and alpha helices appear as spiraling ribbons.
Figure 18. Protein Explorer Advanced Explorer Menu
More advanced rendering methods are available through the Advanced Explorer Menu.
Figure 19. Protein Explorer Surfaces Menu
The Surfaces menu allows the user to display the surface of the protein. Several variable are available, including the radius of the probe used to define the surface, as well as several methods of coloring the surface based on chemical and physical properties.
Figure 20. Protein Explorer: HLA-AW Surface Rendering
This rendering style shows the surface of the protein accessible to water. This image is tilted 90 degrees toward the viewer from the previous images.
Figure 21. Protein Explorer: HLA-AW Superimposed Images
By setting the surface to be transparent, it is possible to superimpose another rendering style over it, and see how it fits into the surface. This can convey an idea of how the fold of the chain relates to the overall three-dimensional shape of the protein.
Recommended Reading and Resources:
A detailed introduction to protein structure and function can be found in most introductory
biochemistry textbooks. For example, Lehninger Principles of Biochemistry, 4th Edition, by D.
L. Nelson and M. Cox (sections 2.1, 3.1-3.5, 4.1-4.4, 5.1-5.3).
The Structures of Life at the NIH web site. This site is an introduction to protein structure, structure determination methods, drug design techniques, and other applications of structural
Protein Structure and Function, by Gregory A. Petsko and Dagmar Ringe. This book provides
an overview of the basic biochemistry of structural biology. Topics covered include protein
structure, mechanisms of protein function, regulation of protein function, and case studies of
the kinds of problems that arise in structural biology.
The MIT Biology Hypertextbook. This online textbook provides introductory level coverage of the field of microbiology. It includes cell biology, protein biochemistry, genetics,
metabolism, and molecular biology. New content is typically added over time.
Artificial Intelligence and Molecular Biology. This online book includes chapters on classifying protein structures, predicting protein structure, and analyzing crystallographic and
NMR data to determine protein structure. Of particular interest to readers of the current page
who have a computer science background but need to understand more of the basic underlying
Representing Proteins in Silico and Protein Forward
Topics in this Module
Modeling Proteins on a Computer
In order to construct efficient, maintainable software to deal with and manipulate protein
structures, a suitable way to store these structures has to be adopted. Depending on the ultimate
application, different representations may have advantages and disadvantages from a software
perspective. For example, when designing a simple visualization software, the Cartesian (x,y,z)
coordinates of each atom are useful and simple to render on the screen. However, if the program is
to manipulate bond angles and bond lengths for example, a representation based on the internal
degrees of freedom (see below) may be more appropriate. Some applications may even need to
store more than one representation at a time; for example a simulation program that needs to
compute a protein's Potential Energy, which is a function of both Cartesian and Internal
coordinates, would benefit from keeping both representations at the same time.
The structure of a protein is the set of atoms it contains, and the bonds that join them, that is, its
inherent connectivity. A particular geometric shape of a protein (that is, the spatial arrangement of
the atoms in the molecule) is called its conformation. Thus, a given protein structure can have
many different conformations. Next, we discuss the two most common ways to model protein
structures and conformations for software applications: Cartesian and Dihedral representations.
Cartesian Representation of Protein Conformations
The most essential information for modeling a protein structure is the relative position of each
atom, given as (x,y,z) Cartesian coordinates. Popular imaging methods such as X-Ray
Crystallography, Nuclear Magnetic Resonance (NMR) and Cryogenic Electron Microscopy (Cryo-
EM) are used to experimentally obtain relative atom positions from protein crystals and solutions.
This is precisely the information provided by Protein Databank (PDB) format coordinate files:
Figure 22. First 19 atom coordinate records of PDB entry 2HLA
The third column lists the atom type and the seventh, eighth, and ninth columns contain the x, y, and z coordinates of each atom.
These Cartesian coordinates are given in relation to some reference frame determined by the experimental imaging technique, which is not important. The conformation is uniquely specified by the relative positioning of the atoms.
The coordinates and type of each atom, together with the amino acid type they belong to, are
sufficient information to reconstruct the connectivity (bonding) of a protein, and therefore
sufficient to render an image of the protein. If one wishes to allow the protein to move in a
realistic fashion, however, more information may be necessary.
The Internal Degrees of Freedom of a Protein
The degrees of freedom of a system are a set of parameters that may be varied independently to
define the state of the system. For example, the location of a point in the Cartesian 2D plane may
be defined as a displacement along the x-axis and a displacement along the y-axis, given as a (x,y)
pair. It may also be given as a rotation about the origin by θ degrees and a distance r from the
origin, given as a (r,θ) pair. In either case, a point moving freely in a plane has exactly two
degrees of freedom.
As mentioned before, the spatial arrangement of the atoms in a protein constitute its
conformation. In the PDB coordinate file above, we can see that one obvious way to define a
protein conformation is by giving x, y, and z coordinates for each atom, relative to some arbitrary
origin. These are not independent degrees of freedom, however, because atoms within a molecule
are not allowed to leave the vicinity of their neighboring atoms (if no chemical reaction takes
place). Pairs of atoms bonded to each other, for example, are constrained to remain close, so
moving one atom causes others connected to it to move in a dependent fashion. In the kinematics
terminology, this means that the true, effective or independent number of degrees of freedom is
much less than the input space parameters -an (x,y,z) tuple for each atom-. The remainder of this
section defines a set of independent degrees of freedom that more readily model how proteins and
other organic molecules can actually move.
Bonds and Bond Length
The atoms in proteins are connected to one another through covalent bonds. Each pair of bonded
atoms has a preferred separation distance called the bond length. The bond length can vary
slightly with a spring-like vibration, and is thus a degree of freedom, but realistic variations in
bond length are so small that most simulations assume it is fixed for any pair of atoms. This is a
very common assumption in the literature and reduces the effective degrees of freedom of a
protein; the remainder of this module makes this assumption.
Although bond lengths will not be allowed to vary in this work, the presence of bonds is still
important because it allows us to represent the connectivity of the protein as an undirected graph
data structure, where the atoms are the nodes and the bonds between them are undirected edges. In
some cases, it is helpful to artificially break any cycles in the graph, and choose an atom from the
interior as an anchor atom. The graph can then be treated as a tree data structure, with the anchor
atom as the root.
Figure 23. A Protein as a Graph Data Structure
A tree-like representation of protein connectivity, for a very small molecule. Cycles are broken by ignoring one bond in each.
Bond length is an independent degree of freedom given two connected atoms. A set of three atoms
bonded in sequence defines another degree of freedom: the angle between the two adjacent bonds.
This is, appropriately, referred to as the bond angle. The bond angle can be calculated as the angle
between the two vectors corresponding to the bonds from the central atom to each of its neighbors.
As a reminder, the angle between two vectors is the inverse cosine of the ratio of the dot product
of the vectors to the product of their lengths. Like bond lengths, bond angles tend to be
characteristic of the atom types involved, and, with few exceptions, vary little. Thus, like bond
lengths, this module considers all bond angles as fixed (again, this is a common assumption).
In most organic molecules, including proteins, the most important internal degree of freedom is
rotation about dihedral (torsional) angles. A dihedral angle is defined by four consecutively