Geometric Methods in Structural Computational Biology by Lydia Kavraki - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.


Geometric Methods in Structural Computational


By: Lydia Kavraki

Online: <>

This selection and arrangement of content as a collection is copyrighted by Lydia Kavraki.

It is licensed under the Creative Commons Attribution License:

Collection structure revised: 2007/06/11

For copyright and attribution information for the modules contained in this collection, see the " Attributions" section at the end of the collection.

Geometric Methods in Structural Computational


Table of Contents

Structural Computational Biology: Introduction and Background


Proteins and Their Significance to Biology and Medicine

Protein Structure

Experimental Methods for Protein Structure Determination

X-ray Crystallography


Electron Diffraction

Structure Prediction of Large Complexes

Protein Structure Repositories

Visualizing Protein Structures

Visualizing HLA-AW with VMD

Visualizing HLA-AW with Protein Explorer

Representing Proteins in Silico and Protein Forward Kinematics


Modeling Proteins on a Computer

Cartesian Representation of Protein Conformations

The Internal Degrees of Freedom of a Protein

Bonds and Bond Length

Bond Angles

Dihedral Angles

Dihedral Representation of Protein Conformations

Protein Forward Kinematics

Mathematical Background: Matrices and Transformations

Forward Kinematics

A Simple Approach

Denavit-Hartenberg Local Frames


Protein Inverse Kinematics and the Loop Closure Problem


Background Material

Inverse Kinematics and its Relevance to Proteins

Solving Inverse Kinematics

Inverse Kinematics Methods

Classic Inverse Kinematics Methods

Inverse Kinematics Methods with Optimization

Cyclic Coordinate Descent and Its Application to Proteins


Molecular Shapes and Surfaces



Representing Shape


Computing the Alpha-Shape: Delaunay Triangulation

Weighted Alpha Shapes

Calculating Molecular Volume Using α-Shapes



Molecular Distance Measures


Comparing Molecular Conformations


Optimal Alignment for lRMSD Using Rotation Matrices

Optimal Alignment for lRMSD Using Quaternions

Introduction to Quaternions

Quaternions and Three-Dimensional Rotations

Optimal Alignment with Quaternions

Intramolecular Distance and Related Measures


Protein Classification, Local Alignment, and Motifs


Protein Classification

Protein Alignment

Protein Classification

Local Matching: Geometric Hashing, Pose Clustering and Match Augmentation


Protein Function Prediction

Identification of Matches

Match Augmentation

Seed Matching


Filtering Matches

Designing Effective Motifs


Dimensionality Reduction Methods for Molecular Motion



Dimensionality Reduction

Principal Components Analysis

PCA of conformational data

Non-Linear Methods

Isometric Feature Mapping (Isomap)


Robotic Path Planning and Protein Modeling


Proteins as Robotic Manipulators

Robotic Path Planning


The Path Planning Problem

Sampling-Based Path Planning

Sampling Based Planners for Proteins


Motion Planning for Proteins: Biophysics and Applications


Free Energy and Potential Functions

Free Energy

Potential Functions

Terms of energy functions


Bond Angles


Van der Waals Interactions and Steric Clash

Electrostatic Interactions

Other Classes of Interactions


An Example: The CHARMM All-Atom Empirical Potential

Applications of Roadmap Methods

Kinetics of Protein Folding

A PRM-Based Approach

Stochastic Roadmap Simulations

Markovian State Models

Protein-Ligand Docking Pathways and Kinetics


Protein-Ligand Docking, Including Flexible Receptor-Flexible Ligand Docking


Background and Motivation

Components of a Docking Program

Ligand placement algorithm

Scoring function

Explicit force field scoring function

Empirical scoring functions

Knowledge-based scoring functions

Rigid Receptor Docking

Parameterization of the Problem

Examples of rigid-receptor docking programs

Autodock 3.0

Search technique

Scoring function


Search technique

Scoring function


Search technique

Scoring function


Flexible Receptor Docking


Flexibility Representations

Soft Receptors

Selection of Specific Degrees of Freedom

Multiple Receptor Structures

Molecular Simulations

Collective Degrees of Freedom


Chapter 1. Homework assignments

1.1. Assignment 1: Visualization and Ranking of Protein Conformations

Protein Data Bank

Visualizing Protein Conformations

A. Visualizing a Set of Conformations

B. Molecules in Motion

Ranking Conformations

Visualizing Protein Substructures

Structurally Classifying Proteins

For Submission

Appendix: Installing VMD

1.2. Assignment 2: Performing Rotations

"Defining the Connnectivity of a Backbone Chain"

Dihedral Rotations

Setup with Matlab

Ranking by Energy

For Submission

1.3. Assignment 3: Inverse Kinematics

Motivation for Inverse Kinematics in Proteins

Inverse Kinematics for a Polypeptide Chain

Setup with Matlab

For Submission



Structural Computational Biology: Introduction and


Topics in this Module

Proteins and Their Significance to Biology and Medicine

Protein Structure

Experimental Methods for Protein Structure Determination

Protein Structure Repositories

Visualizing Protein Structures

Proteins and Their Significance to Biology and Medicine

Proteins are the molecular workhorses of all known biological systems. Among other functions,

they are the motors that cause muscle contraction, the catalysts that drive life-sustaining chemical

processes, and the molecules that hold cells together to form tissues and organs.

The following is a list of a few of the diverse biological processes mediated by proteins:

Proteins called enzymes catalyse vital reactions, such as those involved in metabolism, cellular

reproduction, and gene expression.

Regulatory proteins control the location and timing of gene expression.

Cytokines, hormones, and other signalling proteins transmit information between cells.

Immune system proteins recognize and tag foreign material for attack and removal.

Structural proteins prevent cells from collapsing on themselves, as well as forming large

structures such as hair, nails, and the protective, largely impermeable outer layer of skin. They

also provide a framework along which molecules can be transported within cells.

The estimate of the number of genes in the human genome has been changing dramatically since it

was annotated (the latest gene count estimates can be found in this Wikipedia article on the

human genome). Each gene encodes one or more distinct proteins. The total number of distinct proteins in the human body is larger than the number of genes due to alternate splicing. Of those, only a small fraction have been isolated and studied to the point that their purpose and mechanism



of activity is well understood. If the functions and relationships between every protein were fully

understood, we would most likely have a much better understanding of how our bodies work and

what goes wrong in diseases such as cancer, amyotrophic lateral sclerosis, Parkinson's, heart

disease and many others. As a result, protein science is a very active field. As the field has

progressed, computer-aided modeling and simulation of proteins have found their place among the

methods available to researchers.

Protein Structure

An amino acid is a simple organic molecule consisting of a basic (hydrogen-accepting), amine

group bound to an acidic (hydrogen-donating) carboxyl group via a single intermediate carbon


Figure 1. An α-amino acid

A generic α-amino acid. The "R" group is variable, and is the only difference between the 20 common amino acids. This form is called a zwitterion, because it has both positive and negatively charged atoms. The zwitterionic state results from the amine group (NH2) gaining a hydrogen atom from solution, and the acidic group (COO) losing one.

During the translation of a gene into a protein, the protein is formed by the sequential joining of

amino acids end-to-end to form a long chain-like molecule, or polymer. A polymer of amino

acids is often referred to as a polypeptide. The genome is capable of coding for 20 different

amino acids whose chemical properties depend on the composition of their side chains ("R" in the

above figure). Thus, to a first approximation, a protein is nothing more than a sequence of these

amino acids (or, more properly, amino acid residues, because both the amine and acid groups lose

their acid/base properties when they are part of a polypeptide). This sequence is called the

primary structure of the protein.

Figure 2. A polypeptide

A generic polypeptide chain. The bonds shown in yellow, which connect separate amino acid residues, are called peptide bonds.





The Wikipedia entry on amino acids provides a more detailed background, including the structure, properties, abbreviations, and genetic codes for each of the 20 common amino acids.

The primary structure of a protein is easily obtainable from its corresponding gene sequence, as

well as by experimental manipulation. Unfortunately, the primary structure is only indirectly

related to the protein's function. In order to work properly, a protein must fold to form a specific

three-dimensional shape, called its native structure or native conformation. The three-

dimensional structure of a protein is usually understood in a hierarchical manner. Secondary

structure refers to folding in a small part of the protein that forms a characteristic shape. The

most common secondary structure elements are α-helices and β-sheets, one or both of which are

present in almost all natural proteins.

Figure 3. Secondary Structure: α-helix

α-helices, rendered three different ways. Left is a typical cartoon rendering, in which the helix is depicted as a cylinder. Center shows a trace of the backbone of the protein. Right shows a space-filling model of the helix, and is the only rendering that shows all atoms (including those on side chains).

<db:title> Bond representation

<db:title> Cartoon


<db:title> Ribbon

representation </db:title>

representation </db:title>

(c) Each segment in this representation

represents a bond. Unlike the other two

(a) Different parts of the polypeptide

(b) β-sheets are sometimes referred to as β

representations, side chains are illustrated.

strand align with each other to form a β- pleated sheets, because of the regular zig- Note the alignment of oxygen atoms (red) sheet. This β-sheet is anti-parallel,

zag of the strands evident in this

toward nitrogen atoms (blue) on adjacent

because adjacent segments of the protein representation.

strands. This alignment is due to hydrogen

run in opposite directions.

bonding, the primary interaction involved in

stabilizing secondary structure.

Figure 4. Secondary Structure: β-sheet

Beta-sheets represented in three different rendering modes: cartoon, ribbon, and bond representations.

Tertiary structure refers to structural elements formed by bringing more distant parts of a chain

together into structural domains. The spatial arrangement of these domains with respect to each

other is also considered part of the tertiary structure. Finally, many proteins consist of more than

one polypeptide folded together, and the spatial relationship between these separate polypeptide

chains is called the quaternary structure. It is important to note that the native conformation of a

protein is a direct consequence of its primary sequence and its chemical environment, which for

most proteins is either aqueous solution with a biological pH (roughly neutral) or the oily interior

of a cell membrane. Nevertheless, no reliable computational method exists to predict the native

structure from the amino acid sequence, and this is a topic of ongoing research. Thus, in order to

find the native structure of a protein, experimental techniques are deployed. The most common

approaches are outlined in the next section.

Experimental Methods for Protein Structure Determination

A structure of a protein is a three-dimensional arrangement of the atoms such that the integrity of

the molecule (its connectivity) is maintained. The goal of a protein structure determination

experiment is to find a set of three-dimensional (x, y, z) coordinates for each atom of the molecule

in some natural state. Of particular interest is the native structure, that is, the structure assumed by

the protein under its biological conditions, as well as structures assumed by the protein when in

the process of interacting with other molecules. Brief sketches of the major structure

determination methods follow:

X-ray Crystallography

The most commonly used and usually highest-resolution method of structure determination is x-

ray crystallography. To obtain structures by this method, laboratory biochemists obtain a very

pure, crystalline sample of a protein. X-rays are then passed through the sample, in which they are

diffracted by the electrons of each atom of the protein. The diffraction pattern is recorded, and can

be used to reconstruct the three-dimensional pattern of electron density, and therefore, within

some error, the location of each atom. A high-resolution crystal structure has a resolution on the

order of 1 to 2 Angstroms (Å). One Angstrom is the diameter of a hydrogen atom (10^-10 meter,

or one hundred-millionth of a centimeter).

Unlike other structure determination methods, with x-ray crystallography, there is no fundamental

limit on the size of the molecule or complex to be studied. However, in order for the method to

work, a pure, crystalline sample of the protein must be obtained. For many proteins, including

many membrane-bound receptors, this is not possible. In addition, a single x-ray diffraction

experiment provides only static information - that is, it provides only information about the native

structure of the protein under the particular experimental conditions used. As we will see later,

proteins are often flexible, dynamic objects when in their natural state in solution, so a single

structure, while useful, may not tell the full story. More information on X-ray Crystallography is

available at Crystallography 101 and in the Wikipedia.


Nuclear Magnetic Resonance (NMR) spectroscopy has recently come into its own as a protein

structure determination method. In an NMR experiment, a very strong magnetic field is

transiently applied to a sample of the protein being studied, forcing any magnetic atomic nuclei

into alignment. The signal given off by a nucleus as it returns to an unaligned state is

characteristic of its chemical environment. Information about the atoms within two chemical

bonds of the resonating nucleus can be deduced, and, more importantly, information about which

atoms are spatially near each other can also be found. The latter information leads to a large

system of distance constraints between the atoms of the protein, which can then be solved to find a

three-dimensional structure. Resolution of NMR structures is variable and depends strongly on the

flexibility of the protein. Because NMR is performed on proteins in solution, they are free to

undergo spatial rearrangements, so for flexible parts of the protein, there may be many more than

one detectable structures. In fact, NMR structures are generally reported as ensembles of 20-50

distinct structures. This makes NMR the only structure determination technique suited to

elucidating the behavior of intrinsically unstructured proteins, that is, proteins that lack a well-

defined tertiary structure. The reported ensemble may also provide insight into the dynamics of

the protein, that is, the ways in which it tends to move.

NMR structure determination is generally limited to proteins smaller than 25-30 kilodaltons

(kDa), because the signals from different atoms start to overlap and become difficult to resolve in

that range. Additionally, the proteins must be soluble in concentrations of 0.2-0.5 mM without

aggregation or precipitation. For more information on how NMR is used to find molecular

structures, please see NMR Basics and The World of NMR: Magnets, Radio Waves, and

Detective Work at the National Institutes of Health's The Structures of Life website.

Electron Diffraction

Electron diffraction works under the same principle as x-ray crystallography, but instead of x-

rays, electrons are used to probe the structure. Because of difficulties in obtaining and interpreting

electron diffraction data, it is rarely used for protein structure determination. Nevertheless, ED

structures do exist in the PDB. For more on ED, see this Wikipedia article.

Structure Prediction of Large Complexes

Large macromolecular complexes and molecular machines present a particular challenge in

structure determination. Generally too large to be crystallized, and too complex to solve by NMR,

determining the structure of these objects usually requires the combination of high-resolution

microscopy combined with computational refinement and analysis. The main techniques used are

cryo-electron microscopy (Cryo-EM) and standard light microscopy.

Protein Structure Repositories

Most of the protein structures discovered to date can be found in a large protein repository called

the RCSB Protein DataBank (PDB). The Protein Data Bank (PDB) is a public domain repository that contains experimentally determined structures of three-dimensional proteins. The

majority of the proteins in the PDB have been determined by x-ray crystallography, but the

number of proteins determined using NMR methods has been increasing as efficient

computational techniques to derive structures from NMR data have been developed. A few

electron diffraction structures are also available. The PDB was originally established at

Brookhaven National Laboratory in October, 1971, with 7 structures. Currently, the database is

maintained by Rutgers University, the State University of New Jersey, the San Diego

Supercomputer Center at the University of California, San Diego, and the National Institute of

Standards and Technology. The current number of proteins (and/or nucleic acids) in the PDB

database is displayed at the top-right corner of the main PDB page. The imaging method statistics

of these structures (i.e., which methods were used for what fraction of the structures), as well as

other classifications, can be found here. The European Bioinformatics Institute Macromolecular Structure Database group (UK) and the Institute for Protein Research at Osaka University (Japan)

are international contributors to the contents of the PDB.

Visualizing Protein Structures

Numerous tools are available for visualizing the structures stored in the PDB and other

repositories. Most such tools allow a detailed examination of the molecule in a variety of

rendering modes. For example, sometimes it may be useful to have a detailed image of the surface

of the molecule as experienced by a molecule of water. For other purposes, a simple, cartoonish

representation of the major structural features may be sufficient.

A Few Molecular Visualization Programs

Visual Molecular Dynamics (VMD) was originally developed for viewing molecular simulation trajectories. It is a very powerful, full-featured, and customizable molecular viewing

package. Customization is available using Tcl/Tk scripting. Information on Tcl/Tk scripting

can be found at this Tcl/Tk website.

PyMol is an open-source molecular viewer that can be used to generate professional-looking images. PyMol is highly customizable through the Python scripting language.

Protein Explorer is an easy-to-use, web browser-based visualization tool. Protein explorer is Protein Explorer is an easy-to-use, web browser-based visualization tool. Protein explorer is

built using the MDL Chime browser plugin, which in turn is based on the RasMol viewer.

Because Chime only works under Windows and Macintosh OS, the use of Protein Explorer is

restricted to those platforms.

JMol is a Java-based molecular viewer. In applet form, it can be downloaded on-the-fly to view structures from the web. A stand-alone version also exists, which can be used independently of

a web browser.

Chimera is a powerful visualizer and analysis tool that can be comfortably used with very large molecular complexes. It can also produce very high-quality images for use in

presentations and publications.

Visualizing HLA-AW with VMD

What follows will be a very brief introduction to what can be done with VMD. Only the most

basic viewing functionality will be discussed. For a complete description of the capabilities of

VMD and how to use them, please refer to the VMD web site.

In this section, a human leukocyte-associated antigen, HLA-AW (PDB structure ID 2HLA), will

be shown under various rendering methods in VMD. This section is intended to convey, first, a

general idea of the types of visual representations that are available for protein structures, and

second, what information is and is not conveyed by each representation.

VMD allows the user to load and view molecule description files in a wide variety of common

formats, including trajectory files with multiple structures of the same molecule, such as might be

generated by a simulation. Once the molecules are loaded, the way each molecule is rendered may

be controlled using the Graphical Representations menu:

<db:title>VMD Graphical

Representations menu </db:title>

<db:title>VMD atom

<db:title>VMD molecule

coloring methods

drawing methods






(c) Rendering methods in VMD. Which

(b) Coloring schemes to highlight

one to use depends on the features to

features of interest.


(a) This menu allows the user to control in detail how

each molecule is rendered.

Figure 5.

The built-in rendering options of VMD.

Molecules may be displayed by various rendering modes:



Figure 6. HLA-AW. Drawing method: LINES. Coloring method: NAME

In this representation, each line represents a bond between two atoms. The color of each half-bond corresponds to the element of the atom at the corresponding end of the bond (red for oxygen, blue for nitrogen, yellow for sulfur, and teal for carbon). Line

representation gives a clear idea of the molecule's connectivity, but for large molecules it can be difficult to isolate protein substructures.

Figure 7. HLA-AW. Drawing method: VDW. Coloring method: NAME

Here each atom is represented by a sphere whose radius is the Van der Waals radius of the atom. The Van der Waals radius is half the separation of unbonded atoms packed as tightly as possible, and provides a rough notion of a collision radius, although it is not a firm barrier. This representation of the molecule gives a rough sense of its shape, and is sometimes called a space-filling model.


Figure 8. HLA-AW. Drawing method: VDW. Coloring method: CHAIN

This rendering is the same as in the previous figure, except that now the atoms are colored based on which polypeptide chain they belong to. HLA-AW consists of two chains, the alpha chain (blue), which folds into three domains and the smaller β2 microglobulin (red), which is a component of a whole class of HLA proteins. Coloring by chain allows an inspection of how the polypeptide subunits come together to form the whole quaternary structure of the protein. The black balls are water molecules near the surface of the protein that always appear in the same place in crystal structures, and may therefore be considered part of the structure for some applications.


Figure 9. HLA-AW. Drawing method: SURF. Coloring method: CHAIN

The Surf drawing mode renders a surface swept out by a sphere of some set size skimming the protein. Usually, this size is

approximately that of a water molecule, in which case the rendered surface is very similar to the solvent-accessible surface. Note that it is impossible to deduce the connectivity of the atoms from this image or from the space filling image in the previous figure.

Overall shape, rather than connectivity, is the information conveyed by these representations. Hence, both backbone-based and surface-based renderings are necessary to fully understand a protein's structure.


Figure 10. HLA-AW. Drawing method: SURF. Coloring method: CHAIN

Here the protein has been rotated approximately 90 degrees toward the viewer, so that, compared to the previous image, we are looking down from above. The deep groove running from the top left to lower right is the binding pocket of the protein.


Figure 11. HLA-AW. Drawing method: CARTOON. Coloring method: CHAIN

Cartoon rendering places an emphasis on secondary structure. Beta sheets appear as flattened arrows, and alpha helices appear as cylinders. These are common conventions in representing protein secondary structure. By examining this image, we can see that the walls of the binding pocket observed in the previous figure consist of alpha helices, and the floor is an anti-parallel beta sheet. In anti-parallel beta sheets, adjacent strands run in the opposite direction (notice the arrow points alternate in direction). Note that this representation only conveys information about the backbone connectivity of the protein. Side chain atoms are omitted, and therefore the overall shape is only a very coarse approximation.


Figure 12. HLA-AW. Drawing method: SURF. Coloring method: RESTYPE

Alternative coloring methods can provide additional insight into a protein's structure and function. Here each atom is colored based on whether the side chain of the amino acid residue to which it belongs is acidic (red), basic (blue), polar neutral (green), or apolar (gray). Note that residues on the surface of the protein tend to be hydrophilic (attracted to water, in red, blue, and green), whereas residues closer to the core of the protein tend to be hydrophobic (greasy or water repellant, in gray). This is characteristic of proteins that exist in aqueous solution in nature. Their native structure is stabilized by a tendency for the hydrophilic residues to interact with the solvent water molecules, while the hydrophobic residues are driven together away from the solvent. Clusters of hydrophobic residues on the surface often indicate a location that is usually protected from solvent in the natural state, either by interaction with another molecule or by part of the protein itself.

Visualizing HLA-AW with Protein Explorer

Protein Explorer is designed as a user-friendly but fairly full-featured visualizer. It is not as

scriptable or as powerful as some other visualizers such as VMD and PyMol, but it is one of the

quickest and easiest to get started with. It is used through a web browser, either by accessing it

through the Protein Explorer website (via the Quick-Start Protein Explorer link), or as an offline version, downloadable from this page. Both versions require the MDL Chime molecular viewing plugin, which you can download from here (registration required).

As with VMD above, a human leukocyte-associated antigen, HLA-AW (PDB structure ID 2HLA),

will be shown in various renditions.

Upon opening, Protein Explorer will load a default molecule and display it (this feature may be

disabled via a setting under "preferences" in the lower left frame):


Figure 13. Protein Explorer at Startup

The interface contains three areas. The frame on the right contains the rendering window, where the molecule is displayed. The lower left frame contains an input box for text commands and a text box that displays general text output from the program: What commands have been executed, what the program is currently doing, etc. The top left frame generally contains the user interface in the form of buttons and links. Its exact contents vary with use.

Clicking on the "PE Site Map" link pops up a window containing Protein Explorer's top-level




Figure 14. Protein Explorer Site Map Window

Each option contains a helpful tooltip which can be seen by hovering the mouse cursor over it. "New Molecule" allows the user to load a molecule either directly from the PDB or from the local filesystem. "Reset Session" returns to the default view and rendering style, which can be a useful shortcut. "Quick Views" opens up a menu from which the user can select how the molecule is rendered.

Once a molecule is loaded, the "Quick Views" menu allows the user to control how it is displayed:

Figure 15. Protein Explorer QuickViews Interface


The "SELECT" pulldown menu allows the user to pick a group of atoms based on their properties, their location, the structural elements in which they are involved, or by directly clicking them. The "DISPLAY" pulldown menu then allows the user to determine the style in which the selected atoms are rendered. Most of the styles available through VMD are also available in Protein Explorer.

The "COLOR" pulldown menu allows the user to determine how the atoms are colored. Options include coloring by secondary structure elements, atom type, subunit (chain), a spectrum from end to end of the protein, and by properties such as charge and polarity.

Figure 16. Protein Explorer: HLA-AW Backbone Rendering

This rendering mode shows the protein backbone (no side chains) through the alpha carbons of each amino acid residue. It gives the user a sense of how the chains fold to form the structure, but not it's full shape, since all side chain atoms have been removed. The yellow bars are disulfide bonds, which are covalent bonds that lock distant parts of the chain together to help maintain the structure.



Figure 17. Protein Explorer: HLA-AW Cartoon Style

Cartoon rendering works as for VMD. As in the backbone rendering above, side chains are ignored, and the protein backbone is rendered as a smoothly curving tube. Beta sheets appear as flattened arrows, and alpha helices appear as spiraling ribbons.

Figure 18. Protein Explorer Advanced Explorer Menu

More advanced rendering methods are available through the Advanced Explorer Menu.



Figure 19. Protein Explorer Surfaces Menu

The Surfaces menu allows the user to display the surface of the protein. Several variable are available, including the radius of the probe used to define the surface, as well as several methods of coloring the surface based on chemical and physical properties.

Figure 20. Protein Explorer: HLA-AW Surface Rendering

This rendering style shows the surface of the protein accessible to water. This image is tilted 90 degrees toward the viewer from the previous images.


Figure 21. Protein Explorer: HLA-AW Superimposed Images

By setting the surface to be transparent, it is possible to superimpose another rendering style over it, and see how it fits into the surface. This can convey an idea of how the fold of the chain relates to the overall three-dimensional shape of the protein.

Recommended Reading and Resources:

A detailed introduction to protein structure and function can be found in most introductory

biochemistry textbooks. For example, Lehninger Principles of Biochemistry, 4th Edition, by D.

L. Nelson and M. Cox (sections 2.1, 3.1-3.5, 4.1-4.4, 5.1-5.3).

The Structures of Life at the NIH web site. This site is an introduction to protein structure, structure determination methods, drug design techniques, and other applications of structural


Protein Structure and Function, by Gregory A. Petsko and Dagmar Ringe. This book provides

an overview of the basic biochemistry of structural biology. Topics covered include protein

structure, mechanisms of protein function, regulation of protein function, and case studies of

the kinds of problems that arise in structural biology.

The MIT Biology Hypertextbook. This online textbook provides introductory level coverage of the field of microbiology. It includes cell biology, protein biochemistry, genetics,

metabolism, and molecular biology. New content is typically added over time.

Artificial Intelligence and Molecular Biology. This online book includes chapters on classifying protein structures, predicting protein structure, and analyzing crystallographic and

NMR data to determine protein structure. Of particular interest to readers of the current page

who have a computer science background but need to understand more of the basic underlying

biology is Chapter 1: Molecular Biology for Computer Scientists.

Representing Proteins in Silico and Protein Forward


Topics in this Module

Modeling Proteins on a Computer

Cartesian Representation of Protein Conformations

The Internal Degrees of Freedom of a Protein

Dihedral Representation of Protein Conformations

Protein Forward Kinematics

Mathematical Background: Matrices and Transformations

Forward Kinematics

A simple approach

Denavit-Hartenberg Local Frames

Modeling Proteins on a Computer

In order to construct efficient, maintainable software to deal with and manipulate protein

structures, a suitable way to store these structures has to be adopted. Depending on the ultimate

application, different representations may have advantages and disadvantages from a software

perspective. For example, when designing a simple visualization software, the Cartesian (x,y,z)

coordinates of each atom are useful and simple to render on the screen. However, if the program is

to manipulate bond angles and bond lengths for example, a representation based on the internal

degrees of freedom (see below) may be more appropriate. Some applications may even need to

store more than one representation at a time; for example a simulation program that needs to

compute a protein's Potential Energy, which is a function of both Cartesian and Internal

coordinates, would benefit from keeping both representations at the same time.

The structure of a protein is the set of atoms it contains, and the bonds that join them, that is, its

inherent connectivity. A particular geometric shape of a protein (that is, the spatial arrangement of

the atoms in the molecule) is called its conformation. Thus, a given protein structure can have

many different conformations. Next, we discuss the two most common ways to model protein

structures and conformations for software applications: Cartesian and Dihedral representations.


Cartesian Representation of Protein Conformations

The most essential information for modeling a protein structure is the relative position of each

atom, given as (x,y,z) Cartesian coordinates. Popular imaging methods such as X-Ray

Crystallography, Nuclear Magnetic Resonance (NMR) and Cryogenic Electron Microscopy (Cryo-

EM) are used to experimentally obtain relative atom positions from protein crystals and solutions.

This is precisely the information provided by Protein Databank (PDB) format coordinate files:

Figure 22. First 19 atom coordinate records of PDB entry 2HLA

The third column lists the atom type and the seventh, eighth, and ninth columns contain the x, y, and z coordinates of each atom.

These Cartesian coordinates are given in relation to some reference frame determined by the experimental imaging technique, which is not important. The conformation is uniquely specified by the relative positioning of the atoms.

The coordinates and type of each atom, together with the amino acid type they belong to, are

sufficient information to reconstruct the connectivity (bonding) of a protein, and therefore

sufficient to render an image of the protein. If one wishes to allow the protein to move in a

realistic fashion, however, more information may be necessary.

The Internal Degrees of Freedom of a Protein

The degrees of freedom of a system are a set of parameters that may be varied independently to

define the state of the system. For example, the location of a point in the Cartesian 2D plane may

be defined as a displacement along the x-axis and a displacement along the y-axis, given as a (x,y)

pair. It may also be given as a rotation about the origin by θ degrees and a distance r from the

origin, given as a (r,θ) pair. In either case, a point moving freely in a plane has exactly two

degrees of freedom.

As mentioned before, the spatial arrangement of the atoms in a protein constitute its

conformation. In the PDB coordinate file above, we can see that one obvious way to define a

protein conformation is by giving x, y, and z coordinates for each atom, relative to some arbitrary


origin. These are not independent degrees of freedom, however, because atoms within a molecule

are not allowed to leave the vicinity of their neighboring atoms (if no chemical reaction takes

place). Pairs of atoms bonded to each other, for example, are constrained to remain close, so

moving one atom causes others connected to it to move in a dependent fashion. In the kinematics

terminology, this means that the true, effective or independent number of degrees of freedom is

much less than the input space parameters -an (x,y,z) tuple for each atom-. The remainder of this

section defines a set of independent degrees of freedom that more readily model how proteins and

other organic molecules can actually move.

Bonds and Bond Length

The atoms in proteins are connected to one another through covalent bonds. Each pair of bonded

atoms has a preferred separation distance called the bond length. The bond length can vary

slightly with a spring-like vibration, and is thus a degree of freedom, but realistic variations in

bond length are so small that most simulations assume it is fixed for any pair of atoms. This is a

very common assumption in the literature and reduces the effective degrees of freedom of a

protein; the remainder of this module makes this assumption.

Although bond lengths will not be allowed to vary in this work, the presence of bonds is still

important because it allows us to represent the connectivity of the protein as an undirected graph

data structure, where the atoms are the nodes and the bonds between them are undirected edges. In

some cases, it is helpful to artificially break any cycles in the graph, and choose an atom from the

interior as an anchor atom. The graph can then be treated as a tree data structure, with the anchor

atom as the root.

Figure 23. A Protein as a Graph Data Structure

A tree-like representation of protein connectivity, for a very small molecule. Cycles are broken by ignoring one bond in each.

Bond Angles

Bond length is an independent degree of freedom given two connected atoms. A set of three atoms

bonded in sequence defines another degree of freedom: the angle between the two adjacent bonds.

This is, appropriately, referred to as the bond angle. The bond angle can be calculated as the angle

between the two vectors corresponding to the bonds from the central atom to each of its neighbors.

As a reminder, the angle between two vectors is the inverse cosine of the ratio of the dot product


of the vectors to the product of their lengths. Like bond lengths, bond angles tend to be

characteristic of the atom types involved, and, with few exceptions, vary little. Thus, like bond

lengths, this module considers all bond angles as fixed (again, this is a common assumption).

Dihedral Angles

In most organic molecules, including proteins, the most important internal degree of freedom is

rotation about dihedral (torsional) angles. A dihedral angle is defined by four consecutively