Foresight Update 6 Page 3

With links added to 1997 WWW sources of information

Protein engineering is a new field; separate meetings, journals, and books devoted to the topic have only appeared during the last two to three years. One of the earliest uses of the term was in an article contained in a special issue of Science (February 11, 1983) that was devoted to the new field of biotechnology: "Protein Engineering" by Kevin M. Ulmer (Science 219:666-671). This article is an overview that describes how several advances in different fields have made it possible to attempt to modify many properties of proteins by combining information on three-dimensional structure and classical protein chemistry with new methods of genetic engineering and molecular graphics. Ulmer concludes that this developing technology will be used to further academic understanding of how proteins work, and to produce altered proteins for improved commercial products. He also envisions

The basic set of ideas works something like this. You use computer graphics to display the three dimensional structure, which has been experimentally determined for the protein that you are studying, or, failing that, of a protein that is sufficiently closely related to serve as a model. You then combine calculation and guesswork to decide what modifications in structure might bring about a desired change in the properties of the protein.

You then produce this new protein in one of two ways. If it is a very small protein, often called a peptide, you can synthesize it chemically by using the Merrifield solid phase technique. This has the additional advantage that you can use chemical building blocks in addition to the ones used in biological systems. If the protein is not very small, you can use the new techniques of biotechnology to make a gene that will encode an altered protein. One part of this technology embodies solid phase oligodeoxynucleotide synthesis, a variant on the Merrifield technique, to synthesize a small piece of DNA to encode the altered portion of the protein. This small piece is then used to mutate the natural gene to the gene for the desired protein using the gene splicing methods of recombinant DNA. The altered gene is introduced into a vector-host system of the type used in biotechnology to abundantly produce proteins that are not "well-expressed" (produced) in nature. This protein can then be purified and studied to determine how well your predictions worked. This cycle of "guess-experiment-guess again" was not possible prior to the advent of these techniques.

A major part of the intellectual effort of protein engineering
is devoted to solving the "protein folding" problem.

A major part of the intellectual effort of protein engineering is devoted to solving the "protein folding" problem. Enzymology, a branch of classical biochemistry, leads to the idea that a protein's function follows from its three-dimensional, or "tertiary," structure, and further, that its three-dimensional structure follows from the linear sequence of amino acid residues that comprises its "primary" structure. The linear sequence can be deduced from the DNA sequence of the gene that encodes the protein. The relationship between DNA and protein sequences is the genetic code, which was cracked during the late 1950's and early 1960's. Trying to understand how the primary structure determines the tertiary structure of the protein is, however, very much an unsolved problem at this time. It has often been referred to as "the second half of the genetic code." It is unclear just how much of the folding problem will have to be solved to permit the design of novel proteins. Eric Drexler (in Engines of Creation and a 1981 PNAS article) has pointed out that natural proteins may embody obscure sequence-structure relationships for evolutionary reasons, and that it may thus be possible to develop a much simpler sequence-structure code for designed proteins. Whatever the size of the problem, solving some version of the protein folding problem will be a key step to using protein engineering to design first generation assemblers.

Webmaster's Note: An excellent introduction to protein structure is available on the Internet. The following links are to the primary site at Birkbeck College in England, and to the American mirror at Brookhaven National Laboratory:

A number of more recent overviews of protein engineering are available. The Preface by Dale L. Oxender and C. Fred Fox, and the Introduction by Carl O. Pabo to the book Protein Engineering (Oxender and Fox, ed., Alan R. Liss, Inc., New York) are brief statements of the origins of the field. A somewhat more technical introduction is afforded by the article "Protein Engineering" by R. J. Leatherbarrow and A. R. Fersht (1986) in the inaugural issue of Protein Engineering 1:7-16. The latter article considers various techniques used to produce desired mutations in the genes encoding proteins, discusses several proteins that are being intensively studied using these techniques, and summarizes the results of some of these studies. An overview specifically targeted to using chemical synthesis for small proteins instead of genetic engineering techniques is "Protein engineering by Chemical Means?" by R. E. Offord (1987), Protein Engineering 1:151-157.

To appreciate what is involved in protein engineering requires an acquaintance with a number of fields. These include classical biochemistry, especially of proteins; protein structure determination, including new computer graphic methods to represent protein structure and to calculate the effects of different perturbations of the sequence upon the structure; solid-phase methods for the chemical synthesis of proteins; the new methods of genetic engineering, including both basic molecular biology and the techniques of biotechnology. It is the goal of this primer to provide a summary of some of this material and a guide to further in-depth studies.

Modeling Molecules

"Dynamic market emerging for molecular modeling" by Mark Ratner. January 1989. BioTechnology 7:43-44,47.

The following is a summary of the above article:

Hardware and software to model, analyze, and simulate novel molecular structures is still in a very formative stage. The modeling process begins with the coordinates of the 3-dimensional structure (solved by X-ray crystallography) obtained (for proteins) from the Brookhaven National Laboratory's Protein Data Bank (PDB). Only about 300 protein structures have been determined, so in many cases, modeling must be attempted using a known structure that has a similar sequence to the unknown structure.

The molecular graphics program then converts the 3-dimensional coordinates into a picture of the molecule, which can be manipulated on the computer monitor to see specific bonds and other features of the structure, just as a physical model could be handled to observe various features from all angles.

Webmaster's Note: One molecular graphics visualization tool available over the Internet is RasMol, developed by Roger Sayle. RasMol is available for UNIX, VMS, Macintosh and Microsoft Windows (OS/2 and Windows NT). RasMol is available by ftp. Excellent sources of information about RasMol are:

The next step is the use of molecular mechanics programs (based on classical Newtonian mechanics) that calculate forces among the various atoms and minimize the overall energy of the conformation to calculate the preferred actual structure of the protein. Advanced programs use molecular dynamics to refine the structural calculations. The most effective programs need to use a supercomputer for these calculations. Consideration of nonbonded atomic states can require 125 million floating point operations for a given energy. Ab initio calculations to solve the (quantum mechanical) Schrödinger equation can only deal with 10-20 atoms per molecule because of computational limitations.

The market for molecular modeling packages appears to be in flux, with too many packages and too few users at the moment. Software suppliers presently include Polygen (Waltham, MA), Biosym Technologies (San Diego, CA), and Tripos Associates (St. Louis, MO). The latter package will soon include techniques to model by homology--comparing structural motifs that occur frequently in nature. This "knowledge based" approach includes an analysis of the vast protein sequence (not 3-D structure) database to find a useful set of related proteins whose structures can be compared to model the unknown structure. Manufacturers of hardware include SiliconGraphics (Mountain View, CA) and Evans & Sutherland (Salt Lake City, UT).

Molecular modeling may "provide the forum for chemists, physicists, computer scientists, genetic engineers, and protein purifiers to come together."

Molecular Dynamics

In a sense, molecular dynamics is the most fundamental aspect of the study of proteins (or any other molecules) from the perspective of nanotechnology. This area deals with how each of the constitutive atoms of molecules, large or small, move and thus provides a time-evolving structural basis for considering the properties of the molecules. If we wish to make molecular machines, we have to understand how the parts move so that we can make the machines function appropriately. I (JBL, 7/17/88) have little knowledge of the subject, so I give here a few references of places to get started.

Webmaster's Note: One excellent introduction to molecular dynamics has been provided on the WWW by Biosym/MSI at:
http://lmb.niehs.nih.gov/LMB/docs/biosym/950/discover/General/Dynamics/Intro_Dyn.html
This information is part of the comprehensive online documentation of their Discover program, located at:
http://lmb.niehs.nih.gov/LMB/docs/biosym/950/discover/Disco_Home.html
A database of known motions in proteins is available at:
http://hyper.stanford.edu/~mbg/ftp/ProtMotDB/ProtMotDB.all.html

"The Dynamics of Proteins" by Martin Karplus and J. Andrew McCammon. April 1986. Scientific American 254:42-51.

The following is a summary of the above article:

"The molecules essential to life are never at rest; they would be unable to function if they were rigid. The internal motions that underlie their workings are best explored in computer simulations." This introduction begins by pointing out the limitations of trying to understand in detail how proteins function by knowing only the static structure of the crystal, determined by X-ray crystallography (or occasionally in solution by NMR), which represents only the time-averaged structure of the protein.

Better understanding of how proteins function is provided by theoretical studies, based on experimental structural information, that lead to computer simulations of how the protein actually moves. "The most direct approach to protein dynamics is to treat each atom in the protein as a particle responding to forces in the way prescribed by Newtonian physics, in accord with Newton's equations of motion." Remember that average-sized proteins contain 5000 or more atoms. Chemical bonds can be treated like springs, and many weaker forces between non-bonded atoms must be considered so that the force on each atom depends upon the positions of every other atom in the protein. The X-ray crystal structure gives the necessary information to begin the simulation of atomic movements. However, because the X-ray structure is an average structure, it turns out to be a very unrealistic picture of the state of any particular molecule at any particular time so that a very complex set of calculations must be performed to arrive at such an "equilibrated" protein.

If the atoms in myoglobin were fixed in the positions found in
the X-ray-crystallographic structure, myoglobin would be useless

This latter structure is used as the starting point for molecular dynamics simulations of how the molecule will behave. These calculations use steps of the order of a femtosecond. The best simulations follow the protein for as long as a nanosecond (a million such steps), requiring hundreds of hours of supercomputer time. The combination of many small local motions of individual amino acid residues and their constituent atoms can produce more global displacements of different parts of the protein. What sorts of movements are important over what time scales is discussed in general terms. The particular example of myoglobin, the oxygen-binding protein in muscle, is discussed. The striking point is made that "If the atoms in myoglobin were fixed in the positions found in the X-ray-crystallographic structure, myoglobin would be useless: the time required for an oxygen molecule to bind to the heme group or to get out again when needed would be much longer than a whale's lifetime" (or the lifetime of the universe, for that matter). Simulations showed instead how fluctuations in the positions of specific atoms allowed the oxygen to diffuse through the structure in a reasonable amount of time.

Also discussed is how critical parts of enzymatic reactions usually occur over millisecond time scales, a million times as long as can be handled with present computers. Specialized approximations can sometimes be used and are discussed for a few cases. These illustrate "the important role of small, high frequency fluctuations in facilitating some larger and more collective motions of proteins." Karplus predicts that eventually these techniques will lead to the ability to calculate the rates of enzymatic reactions and the binding of small molecules to larger ones, thus providing better ways to modify proteins for industrial purposes.

"Molecular dynamics simulations of proteins" by Martin Karplus. October 1987. Physics Today pp. 68-72.

The following is a summary of the above article:

This review is a bit more technical and considers more the interplay of calculation and experiment to provide more meaningful results. For example, the role of NMR in studying internal motions of proteins is discussed. Conversely, the application of molecular dynamics methods to NMR data is quite useful in deriving three-dimensional protein structures from the data. This process is referred to as "restrained dynamics." The take-home lesson is the same as for the above review, with the list of expected future practical developments expanded to include the design of inhibitors to cure diseases.

A real understanding of molecular dynamics, of course, can not be gotten from brief review articles. A good text book is probably Dynamics of Proteins and Nucleic Acids by Andrew McCammon and Stephen C Harvey. Cambridge University Press, New York, 1987. xii, 234 pp., illus. $39.50.

I say probably because I haven't seen it yet (let alone read it), but I saw two very favorable reviews: one (titled "Good Vibrations") by B. Robson in BioEssays, Volume 8, No. 2 , p. 93-93 (February 1988)--admittedly a periodical published by the same publisher as the book--and the other (titled "Biomolecular Processes," in the more prosaic Science fashion) by R. M. Levy in Science, 8 July, 1988, 241:234-235.

Both reviews agree that it is a very well-organized book and an excellent place to begin to try to understand the field. Despite starting from basics, the book is said to provide the background needed to read the current literature of the field. The book is about the time-dependent motions of these vital molecules. These motions range from small-amplitude atomic vibrations that occur in 0.1 picoseconds to large-scale allosteric transitions that occur in milliseconds to several seconds. The theoretical and computational methods are clearly described, with most emphasis on the nanosecond scale since computational limitations make detailed calculations on longer scales impractical, but these slower processes are discussed in general terms. I take it from what the reviewers say that molecular dynamics approaches can now attempt to predict the three-dimensional structure of small peptides, but since a large protein takes on the order of a second to fold, and current simulations are limited to about a nanosecond scale, we have a factor of a billion to go in predicting three-dimensional structure for large proteins.

Knowledge-Based Structure Prediction

"Knowledge-based prediction of protein structures and the design of novel molecules" by T. L. Blundell, B. L. Sibanda, M. J. E. Sternberg, J. M. Thornton. 1987. Nature 326:347-352.

Abstract: "Prediction of the tertiary structures of proteins may be carried out using a knowledge-based approach. This depends on identification of analogies in secondary structures, motifs, domains or ligand interactions between a protein to be modeled and those of known three-dimensional structures. Such techniques are of value in prediction of receptor structures to aid in the design of drugs, herbicides or pesticides, antigens in vaccine design, and novel molecules in protein engineering."

The following is a summary of the above article:

After discussing the expected utility of structural knowledge in applications from drug design to biological microchips, and noting the fact that sequence information has increased much more rapidly than 3-D structural information, this paper then discusses the various steps involved in prediction of 3-D structure from sequence:

Sequence Alignment

The first step is to compare the sequence of the protein whose 3-D structure is to be predicted with the known sequences of other proteins available in the sequence database. Several algorithms are available to do this. If the new sequence is >25% similar to a sequence in the database, the match is easily distinguished above the background of randomized sequences. It is stated that an alignment score >6 standard deviations above random alignment will give reliable prediction of the secondary structures of most residues.

The Tertiary Structures of Homologous Proteins

In cases where the 3-D structures of homologous proteins are known, structure is conserved in evolution more than is primary protein sequence. Often changes are concentrated in the surface loops of the protein. This observation provides the rationale for using the known structure of the homologous protein to predict the unknown structure.

Modeling by Homology

The aligned sequences are used to predict where one should create insertions, deletions, and replacements in the known structure. This is done using computer graphics; a widely used program is called FRODO. Initial models are then refined by energy minimization programs on the computer to avoid steric clashes. References are given to research that has used this approach.

Modeling Using Multiple Structures

Since only about 100 out of the 300 3-D structures in the Brookhaven databank are nonhomologous, there is often more than one structure available to use as a basis for modeling. Several approaches for simultaneously using different model structures to predict the unknown structure are discussed.

Insertions & Deletions in Loop Regions

Loops are the most difficult regions to construct because the majority of significant differences occur in these regions. Databases and examples for loop construction are discussed in some detail. Particular attention is given to beta-hairpin loops (loops between two adjacent antiparallel beta strands). Ab initio calculations using molecular dynamics are recommended when no structure sufficiently similar for use in modeling can be found.

Energy Minimization and Molecular Dynamics

"Where the proteins have sequence homology of 50% or more, the models predicted by the methods described here will be probably correct to better than 1 Angstrom although individual side chains may be more in error." Some improvement in accuracy can be had by using such energy minimization programs as AMBER or CHARMM. Since energy minimization as it is now done finds only a local minimum, it is only expected to be useful if the errors in the starting structure are less than an Angstrom [Note: This seems a quite stringent requirement to meet-JBL].

How Correct are the Models?

Several cases where modeled proteins have been subsequently studied by X-ray are discussed, with the results shown to have been mixed. It is suggested that the distributions of the hydrophobic side-chains and the nature of the solvent-accessible surfaces are the most sensitive indicators of the reasonableness of the model. [It should be noted that this whole modeling procedure, although useful in some situations, is still very inexact and requires a great deal of experience and knowledge to interpret.-JBL]

Future Developments

Two challenges are discussed: (1) To extend the method to cases where there is no obvious sequence homology, but there is reason to suspect that the structure is a member of a known family of structural motifs, and (2) To design novel molecules.

"Knowledge-based prediction of protein structure" by J. M. Thornton of Birkbeck College, England; from the Miami meeting.

Dr. Thornton notes that the delicate balance between properly folded and alternate structures of a protein has been impossible to predict so far from energy minimization, so that people have tried to use empirical predictions based on the 300 or so protein structures that have been experimentally determined. These have been of limited value. Even simple predictions of secondary structure only, rather than complete tertiary structures, are only about 60% accurate. By considering in more detail characteristics of a particular type of secondary structure (beta-beta hairpin turns), certain sequence features associated with specific varieties of this structure were identified that improved prediction a bit (to over 70%). This is progress, but the empirical approach to protein sequence-structure relationships has a long way to go before we can use it to help design first generation assemblers. A general review of this process of knowledge-based prediction of protein structure; i.e. modeling the structure of an unknown protein based on the known structure of a protein of similar sequence, was published last year: "Knowledge-based prediction of protein structures and the design of novel molecules" by T. L. Blundell, B. L. Sibanda, M. J. E. Sternberg, J. M. Thornton. 1987. Nature 326:347-352. [NOTE: This paper is abstracted above.]

"Protein structure: the shape of things to come?"--A "News and Views" editorial by Janet M. Thornton in the 1 September 1988 issue of Nature 335:10-11.

The following is a summary of the above article:

She observes that attempts to predict structure from sequence have been shifting from calculations using energy functions to the "more pragmatic" structure recognition by pattern matching, as exemplified by a paper by Rooman and Wodak in the same issue (see below). "The good news is that short sequence patterns which reliably define secondary structure do exist. The bad news is that the prediction accuracy ... is still only about 60 per cent." Apparently the main problem is the relative scarcity of structural data. The 60% accuracy is especially discouraging since the original attempts by Chou and Fasman and by Garnier, both in 1978, achieved this level of accuracy using only the 20 protein structures that were known then (vs. >300 today), and used only the helix-or sheet-forming properties of individual residues rather than those of short sequences. Results are quoted from several years ago that only 20% of identical pentapeptides in unrelated proteins of known structure adopt the same secondary structure. The paper below does an automated and systematic search of the structure database, and identifies some peptides that are very predictive, although most peptide sequences are not very predictive.

The reason that most patterns are not predictive is apparently that most occur only a few times in the database so that patterns can not be adequately recognized. It is suggested that sequence patterns should occur about 15 times for accurate prediction, while most 3-residue sequences occur < 3 times in the current database. Rooman and Wodak speculate that a database of 1500 structures will be needed for adequate prediction of secondary structure, which, optimistically, could take 20 years to produce. Thornton suggests that prediction might be improved by (1) incorporating recent sequence interpretation techniques designed to recognize very distantly related proteins so that the known structure of one could be used to model the other, and (2) using what is known in some cases about elements of super-secondary structure--motifs of clustered secondary structure elements associated with particular classes of proteins. She also mentions a recent article in which neural networks were trained to recognize secondary structure from sequences, and got predictions that were 64% accurate.

Webmaster's Note: For more current information on Dr. Thornton's work, see:
http://www.biochem.ucl.ac.uk/bsm/biocomp/index.html

Jim Lewis is a molecular biologist at Oncogen in Seattle. He is also the leader of the PATH HyperCard Project, a project of the Seattle Nanotechnology Study Group, which is working on a HyperCard stack on nanotechnology. The full text of Dr. Lewis's summary from which this adaptation was made is available from the Foresight Institute; send a stamped, self-addressed envelope with 65 cents postage.

From Foresight Update 6, originally published 1 August 1989.

Foresight thanks Dave Kilbridge for converting Update 6 to html for this web page.

Foresight Update 6

page 3

Protein Engineering: An introduction to a newly recognized field

Adapted from: Protein Engineering Literature Scan #3
by James B. Lewis

Modeling Molecules

"Dynamic market emerging for molecular modeling" by Mark Ratner. January 1989. BioTechnology 7:43-44,47.

Molecular Dynamics

"The Dynamics of Proteins" by Martin Karplus and J. Andrew McCammon. April 1986. Scientific American 254:42-51.

"Molecular dynamics simulations of proteins" by Martin Karplus. October 1987. Physics Today pp. 68-72.

Knowledge-Based Structure Prediction

"Knowledge-based prediction of protein structures and the design of novel molecules" by T. L. Blundell, B. L. Sibanda, M. J. E. Sternberg, J. M. Thornton. 1987. Nature 326:347-352.

Sequence Alignment

The Tertiary Structures of Homologous Proteins

Modeling by Homology

Modeling Using Multiple Structures

Insertions & Deletions in Loop Regions

Energy Minimization and Molecular Dynamics

How Correct are the Models?

Future Developments

"Knowledge-based prediction of protein structure" by J. M. Thornton of Birkbeck College, England; from the Miami meeting.

"Protein structure: the shape of things to come?"--A "News and Views" editorial by Janet M. Thornton in the 1 September 1988 issue of Nature 335:10-11.

Foresight Programs

Foresight Update 6

page 3

Protein Engineering: An introduction to a newly recognized field

Adapted from: Protein Engineering Literature Scan #3 by James B. Lewis

Modeling Molecules

"Dynamic market emerging for molecular modeling" by Mark Ratner. January 1989. BioTechnology 7:43-44,47.

Molecular Dynamics

"The Dynamics of Proteins" by Martin Karplus and J. Andrew McCammon. April 1986. Scientific American 254:42-51.

"Molecular dynamics simulations of proteins" by Martin Karplus. October 1987. Physics Today pp. 68-72.

Knowledge-Based Structure Prediction

"Knowledge-based prediction of protein structures and the design of novel molecules" by T. L. Blundell, B. L. Sibanda, M. J. E. Sternberg, J. M. Thornton. 1987. Nature 326:347-352.

Sequence Alignment

The Tertiary Structures of Homologous Proteins

Modeling by Homology

Modeling Using Multiple Structures

Insertions & Deletions in Loop Regions

Energy Minimization and Molecular Dynamics

How Correct are the Models?

Future Developments

"Knowledge-based prediction of protein structure" by J. M. Thornton of Birkbeck College, England; from the Miami meeting.

"Protein structure: the shape of things to come?"--A "News and Views" editorial by Janet M. Thornton in the 1 September 1988 issue of Nature 335:10-11.

Foresight Programs

Adapted from: Protein Engineering Literature Scan #3
by James B. Lewis