Geometric Methods in Structural Computational Biology by Lydia Kavraki - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

quaternions. Journal of the Optical Society of America, 4:629-642.

E. A. Coutsias and C. Seok and K. A. Dill. (2004). Using quaternions to calculate RMSD.

Journal of Computational Chemistry, 25, 1849-1857.

Wallin, S., J. Farwer and U. Bastolla. (2003). Testing similarity measures with continuous

and discrete protein models . Proteins, 50:144-157.

References

1. Kabsch, W. (1976). A Solution for the Best Rotation to Relate Two Sets of Vectors. Acta

Crystallographica, 32, 922-923.

2. Kabsch, W. (1978). A Discussion of the Solution for the Best Rotation to Relate Two Sets of

Vectors. Acta Crystallographica, 34, 827-828.

3. Horn, Berthold K. P. (1986). Closed-form solution of absolute orientation using unit

quaternions. Journal of the Optical Society of America, 4, 629-642.

4. Coutsias, E. A., C. Seok and K. A. Dill. (1978). Using quaternions to calculate RMSD. Journal

4. Coutsias, E. A., C. Seok and K. A. Dill. (1978). Using quaternions to calculate RMSD. Journal

of Computational Chemistry, 25, 1849-1857.

5. Golub, G. H. and Loadn, C. F. V. (1996). Matrix Computations. (third). Johns Hopkins

University Press.

6. Wallin, S., J. Farwer and U. Bastolla. (2003). Testing similarity measures with continuous and

discrete protein models. Proteins, 50, 144-157.

7. Schwarzer, F. and Lotan, I. (2003). Approximation of protein structure for fast similarity

measures. ACM. Proceedings of the seventh annual international conference on research in

computational molecular biology.

Protein Classification, Local Alignment, and Motifs

Topics in this Module

Applications of Molecular Distance Measures

Protein Classification

Protein Alignment

Local Matching: Geometric Hashing, Pose Clustering, and Match Augmentation

In a previous module, the topic of comparing and quantifying the distance between different conformations of a given molecule was explored. Structure-based comparison is also of interest

for distinct proteins, which lack the atom-by-atom correspondence necessary for RMSD

calculations. In this case, an alignment is performed either based on amino acid sequence or on

three-dimensional structure, and the subset of atoms successfully aligned are used as the basis for

calculating conformational distance. Computing distances among entire proteins by doing a global

alignment of their structures is useful for protein classification.

Protein Classification

Protein classification is motivated by the notion of "descriptive biology". When faced with

tremendous amounts of highly complex data, such as with the set of all proteins, one way to

understand the data is by classification: the act of associating or grouping proteins into classes

using certain criteria. One such criterion is protein sequence identity, where sequential similarity

led to the development of phylogenetic trees and multiple sequence analyses. The same is done in

protein structure classification, where the effort is to identify groups of similar proteins, with the

hope that this will yield information about their biochemical function and biological purpose.

Proteins are classified by simultaneously applying a number of criteria, including sequence

homology (evolutionary relatedness), function, folding motifs, structural features, and so on. The

resulting hierarchies and clusters of protein structures provide a notion of the distance between

two proteins and their structures. A couple of popular classification schemes are linked below.

Note that a fair amount of manual annotation and classification was necessary to build these

systems.

Protein Alignment

The core computational problem of protein classification, using sequence or structure, is the

problem of comparing two proteins. For structural classification, one method for comparison is

structural alignment, which identifies an ideal superimposition between two protein structures,

in order to compare them.

SSAP, Dali, Foldminer, Lock, and Geometric Hashing [link] are algorithms which have been

designed in part to align whole protein structures. Despite differences in algorithmic approach, all

of these algorithms essentially evolved from the need to assign the best possible correlation

between points in one structure and points in another. The problem of finding the optimal

alignment is polynomial in the number of atoms in biological data, where we are assured that

atoms cannot fall within a certain distance to each other (Van der Waals forces enforce this), but

without this constraint the problem is exponential.

Protein alignment has been used for the classification and comparison of proteins in many

existing algorithms. These include:

Dali is a structural comparison algorithm based on pairwise distance matrices. Dali uses patterns of residue contacts, similar to contact maps described above in the intramolecular

distances section, in order to align structures. The alignments are found using a randomized

(Monte Carlo) search.

FoldMiner and LOCK 2. FoldMiner finds protein structures similar to an input structure by performing alignment the query structures secondary structure elements with proteins in its

database using the LOCK 2 algorithm. LOCK 2 uses a combination of geometric hashing [link]

and dynamic programming to optimize the alignments of secondary structure elements of

different proteins. Once a set of alignments to similar structures are found, motifs consisting of

similar secondary structure arrangements are constructed and used to refine the similarity

search.

Sequential Structure Alignment Program (SSAP) Given two protein structures, SSAP returns a structural alignment.

Protein Classification

Once alignment algorithms have been implemented, it is possible to explore different

classifications of proteins. Naturally, it would be intuitive to classify proteins solely according to

their structure, but much richer data is available as well. Current classifications of proteins

integrate sequence and structure information in order to maximize their utility. These include the

following:

SCOP (Structural Classification of Protein) is a database of all proteins whose structures have been determined, organized by family (evolutionary relationship), superfamily (structural

and functional similarity), and fold (similar secondary structure, with similar arrangement and

topological connections). SCOP was constructed largely by manual inspection and annotation.

CATH Protein Structure Classification is another database of protein structures, organized

index-82_1.jpg

by Class (secondary structure content), Architecture (orientation of secondary structures),

Topology (overall shape and connectivity), and Homologous Superfamily (evolutionary

relationship inferred based on sequence and structure similarity). CATH uses SSAP (the

Sequential Structural Alignment Program, a secondary structure element-based method) for

structural comparison.

The biological purpose of designing Global Stuructural Alignment algorithms was the

identification, classification, and ultimately prediction, of protein function, under the hypothesis

that protein structure dictates protein function. Simultaneously, it was realized that small changes

to proteins could lead to massive changes in function, or nothing at all. This suggested that global

alignment and global structure comparison (as well as global sequence alignment and global

sequence comparison) should not be the only tools used for function prediction.

In particular, it was realized that active sites, clusterings of amino acids on the surface of proteins

and a tiny minority of the entire protein, were often strongly related to protein function. In a

continuation of the effort to predict protein function through structural comparison, algorithms

were developed to compare functionally relevant substructures of proteins. We refer to these

algorithms collectively as local structure alignment algorithms.

Local Matching: Geometric Hashing, Pose Clustering and Match

Augmentation

Algorithms for local structure alignment address the similar computational problem of selecting a

correspondence between a motif, a tiny substructure of a protein, often between 3 and 20 amino

acids, and a target, a full protein structure. Once a correspondence has been established, the

"distance" of the motif to the identified part of the target is measured using lRMSD.

Figure 67.

Some algorithms for local structure alignment are based on pattern matching algorithms. Pattern

matching algorithms seek target substructures called matches which have maximal geometric

similarity to the motif. An excellent example of this type of algorithm is Geometric Hashing

[link], a very flexible framework for geometric pattern matching under noisy constraints.

Geometric Hashing [link], has been adapted to alignment by atom position [link], by backbone C-alpha [link], multiple structural alignment [link], and alignment of hinge-bending and flexible protein models [link]. Other algorithms for substructural alignment include JESS [link], PINTS

[link], webFEATURE [link], and pvSOAR [link]. The description below concentrates on the work developed in [link].

Motifs

A motif S is a set of m points s 1, ..., s m in three dimensions, whose coordinates are taken from

backbone and side-chain atoms. Each motif point s i in the motif has an associated rank p( s i) , a measure of the functional significance of the motif point. Each s i also has a set of alternate amino

acid labels l( s i) in {GLY, ALA, ...}, which represent residues this amino acid has mutated to

during evolution. Labels permit our motifs to simultaneously represent many homologous active

sites with slight mutations, not just a single active site. In this paper, we obtain labels and ranks

using Evolutionary Trace (ET) [link], [link].

Other motifs have been designed with other approaches. Motifs have been composed of points on

the Connolly surface [link] representing electrostatic potentials [link], of hinge-bending sets of points in space [link], of sets of "pseudo-centers" representing protein-ligand interactions [link], or of points taken from atom coordinates with evolutionary data [link], [link], to name a few.

Depending on how motif points are defined, they have different labels associated with them and

these labels need to be taken into account when comparing motifs.

Protein Function Prediction

Local structure comparison algorithms mainly target the biological problem of Protein Function

Prediction. Ideally, a function prediction pipeline should address the following subproblems:

The design of effective motifs

The efficient identification of matches

The effective filtering of matches

Suppose for a moment that we have hand-designed motifs. In the next section, we concentrate on

an efficient algorithm for local structure comparison.

index-84_1.jpg

Identification of Matches

Many such algorithms exist, but differ fundamentally in that they are optimized for comparing

different types of motifs. There are algorithms for comparing graph-based motifs [link],

algorithms for finding catalytic sites [link], and the seminal Geometric Hashing framework [link]

which can search for many types of motifs, including motifs based on atom position [link], points on Connolly face centers [link], catalytic triads [link], and flexible protein models [link]. Match Augmentation (MA) [link] is described below.

Match Augmentation

Rank information prioritizes motif data and MA was designed in a prioritized fashion, where

correspondences with higher ranked points are identified first. MA is composed of two parts: Seed

Matching and Augmentation. The purpose of Seed Matching is to identify a match for the seed S'

= { s 1, s 2, s 3}, the three highest ranked motif points. The k lowest lRMSD seed matches are passed to Augmentation to be iteratively expanded into matches for the remaining motif points, in

descending rank order. Augmentation outputs the match with smallest lRMSD.

Seed Matching

Seed Matching identifies all sets of 3 target points T' = { t A, t B, t C} which fulfill our matching criteria with the highest ranked 3 motif points, S'= { s 1, s 2, s 3 }. In this stage, we represent the target as a geometric graph with colored edges. There are exactly three unordered pairs of points in S',

and we name them red, blue and green. In the target, if any pair of target points t i, t j fulfills our

first two criteria with either red, blue or green, we draw a corresponding red blue or green edge

between t i, t j in the target. Once we have processed all pairs of target points, we find all three-

colored triangles in T. These are the Seed Matches, a set of three-point correlations to S' that we

sort by lRMSD and pass to Augmentation.

Augmentation

Augmentation is an application of depth first search that begins with the list of seed matches.

Assuming that there are more than four motif points, we must find correspondences for the

unmatched motif points within the target. Interpret the list of seed matches as a stack of partially

complete matches. Pop off the first match, and considering the lRMSD alignment of this match,

plot the position p of the next unmatched motif point s i relative to the aligned orientation of the

motif. In the spherical region V around p, identify all target points t i, compatible with s i, inside V.

Now compute the lRMSD alignment of all correlated points, include the new correlation ( s i, t i). If

the new alignment satisfies our first two criteria and there are no more unmatched motif points,

put this match into a heap which maintains the match with smallest lRMSD. If there are more

unmatched motif points, put this partial match back onto the stack. Continue to test correlations in

this manner, until V contains no more target points that satisfy our criteria. Then, return to the

stack, and begin again by popping off the first match on the stack, repeating this process until the

stack is empty.

Filtering Matches

Structural similarity is important to functional annotation only if a strong correlation exists

between identifiably significant structural similarity and functional similarity. However, the

existence of a match alone does not guarantee functional similarity. lRMSD can be a

differentiating factor. If matches of homologous proteins represent statistically significant

structural similarity over what is expected by random chance, we could differentiate on lRMSD,

as long as we can evaluate the statistical significance of the lRMSD of a match.

BLAST [link] first calculated the statistical significance of sequence matches with a

combinatorial model of the space of similar sequences. Determining the statistical significance of

structural matches has also been attempted. Modeling was applied for the PINTS database [link]

to estimate the probability of a structural match given a particular LRMSD. An artificial

distribution was parameterized by motif size and amino acid composition in order to fit a given

data set, and the p-value is calculated relative to that distribution. Another approach was taken in

the algorithm JESS [link], using comparative analysis to generate a significance score relative to a specific population of known motifs. Both methods have some disadvantages. The artificial

models of PINTS are not parameterized by the geometry of motifs, and, all else equal, produce

identical distributions for motifs of different geometry. JESS, on the other hand, is dependent on a

set of known motifs; should this set change, all significance scores would have to be revised.

Local structural alignment methods operate on the assumption that local structural and chemical

index-86_1.jpg

similarity implies functional similarity. A statistical model has been developed that can be used to

identify the degree of similarity sufficient to follow this implication. Given a match m with

lRMSD r between motif S and target T, exactly one of two hypotheses must hold:

H 0: S and T are structurally dissimilar

H A: S and T are structurally similar

The proposed statistical model implements this measurement by computing a motif profile. Motif

profiles are frequency distributions (see Figure below) of match lRMSDs between S and the entire

Protein Data Bank (PDB) [site], which is essentially a large set of functionally unrelated proteins.

A motif profile is essentially a histogram, where the vertical axis measures the number of matches

at each specific lRMSD, which is measured on the horizontal axis. Motif profiles provide very

complete information about matches typical of H 0. If we suspect that a match m has lRMSD r

indicative of functional similarity, we can use the motif profile to determine the probability p of

observing another match m' with smaller lRMSD by computing the volume under the curve to the

left of r, relative to the entire volume. The probability p, often referred to as the p-value, is the

measure of statistical significance. With a standard of statistical significance alpha, if p is less

than alpha, then we say that the probability of observing a match m' with lRMSD r' less than r is

so low that we reject the null hypothesis in favor of the alternative hypothesis. This process means

that if a match m with lRMSD r has a p-value exactly equal to alpha, then this lRMSD is the

highest lRMSD for which our statistical model predicts that m identifies structural and chemical

similarity sufficient to imply functional similarity. Matches with this property are considered to

be statistically significant.

Figure 68.

The use of motif profiles.

Designing Effective Motifs

Given a motif representing a specific biochemical function, the task of any algorithm seeking to

predict protein functions is to identify other proteins with similar biological functions. This is

immediately affected by how well the motif represents the biological function - protein structures

are flexible, dynamic objects, and any computational representation of these objects is likely to be

inaccurate in some manner. In particular, since it is the computational representation that is

compared, the fidelity of representation is essential for the effectiveness of the structural

comparison approach.

An effective motif must simultaneously fulfill two criteria:

The motif must be sensitive

The motif must be specific

Motifs must maintain geometric and chemical similarity, in respect to the characteristics

compared by the comparison algorithm, to functional homologs. This ensures that algorithms

searching for the motif identify proteins with similar function.

Sensitivity is measured as the proportion of acceptable matches to functional homologs relative to

the total number of functional homologs. Specific motifs must maintain geometric and chemical

dissimilarity, in respect to the characteristics compared by the comparison algorithm, to

functionally unrelated proteins. This ensures that algorithms searching for the motif accidentally

identify less matches to functionally unrelated proteins. Specificity is measured as the proportion

of rejected matches to functionally unrelated proteins, relative to all functionally unrelated

proteins. The ideally effective motif is 100% sensitive and 100% specific.

Under most conditions, effective motifs have been designed by hand. However, a few recently

studies exist which examine the relationship between elements of the motif and the sensitivity and

specificity of the motif. These studies resulted in the design of MultiBind [link], an algorithm which identifies conserved binding patterns among functionally homologous active sites, in order

to generate motifs which represent only conserved binding patterns. Another study of motif design

produced Geometric Sieving [link], an algorithm for maximizing the Geometric Uniqueness of motifs from functionally unrelated proteins.

Putting together effective motif design, an efficient matching algorithm, and a statistical scoring

of the results can lead to an automated functional annotation pipeline for proteins.

Recommended Reading:

Chen, B.Y. (2005). Algorithms for Structural Comparison and Statistical Analysis of 3D

Protein Motifs. Proceedings of the Pacific Symposium on Biocomputing 2005, pp. 334-345.

Chen, B.Y. (2006). Geometric Sieving: Automated Distributed Optimization of 3D Motifs

for Protein Function Prediction. Proceedings of the Tenth Annual Conference on Research in

Computational Molecular Biology (RECOMB 2006), pp. 500-512.

Yao, H., Kristensen, D., Mihalek, I., Sowa, M., Shaw, Ch., Kimmel, M., Kavraki, L. E., and

Yao, H., Kristensen, D., Mihalek, I., Sowa, M., Shaw, Ch., Kimmel, M., Kavraki, L. E., and

Lichtarge, O. (2003). An Accurate, Sensitive, and Scalable Method to Identify Functional

Sites in Protein Structures. Journal of Molecular Biology, 325:255-261.

David M. Kristensen, Brian Y. Chen, Viacheslav Y. Fofanov, R. Matthew Ward, A. Martin