- 1.1 Introduction
- 1.2 Techniques in Biomolecular Simulations
- 1.2.1 Molecular Mechanics and Force Fields
- 1.2.2 Basic Simulation Techniques
- 1.2.3 Basic Data Analysis
- 1.2.4 Software
- 1.2.5 Examples
- 1.3 Protein Structure Prediction
- 1.3.1 Sequence Alignment and Secondary Structure Prediction
- 1.3.2 Comparative Modelling Approaches
- 1.3.3 Function Prediction
- 1.3.4 Analysing the Quality of the Modelled Structure
- 1.3.5 Software and Web Based Servers
- 1.4 Computer-based Drug Design
- 1.4.1 Pre-requisites for SBDD—Sampling Algorithms and Scoring Functions
- 1.4.2 Structure Based Drug Design (SBDD)
- 1.4.3 Ligand Based Drug Design (LBDD)
- 1.4.4 Pharmacophores
- 1.4.5 Compound Optimisation
- 1.4.6 Software and Web Based Servers
Chapter 1: Computational Chemistry and Molecular Modelling Basics
-
Published:25 Oct 2017
-
Special Collection: 2017 ebook collectionSeries: Chemical Biology
S. Genheden, A. Reymer, P. Saenz-Méndez, and L. A. Eriksson, in Computational Tools for Chemical Biology, ed. S. Martín-Santamaría, The Royal Society of Chemistry, 2017, ch. 1, pp. 1-38.
Download citation file:
Computational modelling has gained an increasingly important role in biochemical and biomolecular sciences over the past decades. This is related to significant developments in terms of methodology and software, as well as the amazing technological advances in computational hardware, and fruitful connections across different disciplines. Today, we readily screen virtual libraries of several million compounds searching for potential new inhibitors, run simulations of large biomolecular complexes in micro or even millisecond timescales, or predict protein structures with similar accuracy to high-resolution X-ray crystallography. In this introductory chapter, the basics of biomolecular modelling are outlined, to help set the foundation for the subsequent, more specialised chapters. In order for the chapter to be ‘readable’ to interested researchers and PhD students in the biochemical and biomolecular fields our aim has been to do so without weighing down the text with too much detailed mathematics—yet at the same time providing a sufficient level of theory so as to give an understanding of what is implied when talking about molecular dynamic simulations, docking or homology modelling.
1.1 Introduction
The use of computers for predicting the structures and properties of biomolecules has closely paralleled computer development since the 1950s, and has been one of the core areas of theoretical or computational chemistry for the past 30 years. Initially, the focus was on force-field based methodologies for studying the structures, dynamics and interactions of biomolecules as such, and the development of accurate models for the main biological solvent, water. With the emergence of accurate quantum chemical techniques suitable for studying (from a quantum chemistry perspective) large systems, density functional theory entered the stage in the 1990s as the key approach for investigating enzymatic mechanisms or properties and reactions of small, but biologically relevant, molecules. The combined use of these tools, so-called QM/MM and lately QM/MM-MD techniques enables precise descriptions of biological phenomena and reactions.
With the exponential increase in data to be analysed, obtained through the introduction of automated whole genome and protein sequencing techniques, the field of bioinformatics rapidly emerged in the early 2000s from the pioneering laborious mapping and comparison of protein and gene sequences in molecular biology, via an intense phase, which to a large extent can be viewed as ‘database mining’ and the development of efficient computer based algorithms, into a science of its own, which today has reached a high level of maturity and sophistication. Tools in bioinformatics are nowadays used with great success in structural biology, computational chemistry, genetics, molecular biology, the pharmaceutical industry, pharmacology and more. The aspects of bioinformatics included herein focus on protein structure determination (often referred to as homology modelling), and the tools of database screening and prediction used in drug design.
In this chapter, a brief outline of simulation techniques are given, focusing on the interface between biology and medicinal chemistry; that is molecular mechanics/molecular dynamics to explore the evolution of a system, homology modelling to determine protein structures, and the use of bioinformatics tools such as molecular docking and pharmacophores in drug design. The aim is to provide a brief introduction to a vast and rapidly growing field. In subsequent chapters, more specialised applications are presented, that build upon the foundations given herein. The chapter is in no way intended to be an exhaustive coverage of the entire area of biomolecular simulations, and we have deliberately avoided the inclusion of quantum chemical methods.
The interested reader wishing to dig deeper into the basics of computational modelling is referred to any of the many excellent textbooks available.1–11
1.2 Techniques in Biomolecular Simulations
1.2.1 Molecular Mechanics and Force Fields
The palette of computational chemistry methods has become increasingly versatile. Starting from quantum chemistry, where molecular orbitals and electrons occupying these are described, allows us to calculate any physical or chemical property that directly depends on the electron distribution; reaching all the way to coarse-grained molecular dynamics simulations, where groups of atoms described as beads interacting by laws of Newtonian mechanics, providing valuable insights into the complexity of biological processes on a bigger, cellular level scale. For comparison, a feasible size of a system treated by quantum chemistry calculations, even today, does not exceed a few hundred atoms, whereas the empirical methods, e.g. molecular mechanics (MM), can easily handle several hundred thousand atoms, and in case of a coarse-grained approach—several million atoms. Thus, the latter class of methods has become popular among researchers dealing with bio-macromolecular systems, which exist and function in aqueous solutions or lipid environments. The surrounding environment could take up to 90% of all atoms in a model system, and its presence is crucial for the correct representation of living matter. The giant leap in system size is possible due to reasonable simplicity of the MM potential energy functional. The potential energy is calculated by adding up the energy terms that describe interactions between bonded atoms (bonds, angles and torsions) and terms that describe the non-bonded interactions, such as van der Waals and electrostatic interactions (eqn (1.1)).
The bonded terms represent the stretching of bonds (l), bending of valence angles (θ) and rotation of torsional angles (ω); cf. Figure 1.1. Three force constants: kl, kθ and Vn characterise the energetic cost relative to the equilibrium value, needed to increase the value of a bond length (l0), angle (θ0) or rotation around a torsion angle. The torsion term represents a periodic rotation of a dihedral angle with periodicity n and phase γ. The non-bonded energy is the sum of repulsion, attraction and electrostatics between non-bonded atoms. The parameter εij is related to the well-depth of Lennard-Jones (LJ) potential, r0ij is the distance at which the LJ potential has its minimum. qi is the partial atomic charge, ε0 is the vacuum permittivity, and rij is the distance between atom i and atom j. The LJ and Coulomb potentials describe the short-range non-bonded interactions. The evaluations of the long-range electrostatic interactions can be difficult and was often ignored beyond a specific cut-off distance resulting in approximations in a calculation. With the introduction of Ewald summation and particle mesh Ewald (PME) method long-range electrostatic calculations have become significantly more accurate.12,13
The simplicity of the potential energy functional form means, on the one hand, fast and easy calculations, and on the other hand that the accuracy of the empirical methods is highly dependent on the set of empirically derived parameters describing atoms and their interactions. These parameters are either derived from ab initio or semi-empirical quantum chemistry calculations on small model systems or by fitting to experimental data, e.g. X-ray and electron diffraction, NMR and IR spectroscopy. The potential energy functional form and the empirically derived parameters can be both referred to as a force field.
There are a number of empirical force fields families available, having different degrees of complexity, and oriented to treat different kinds of systems. The most popular ones designed for biological macromolecules are AMBER,14,15 CHARMM,16 and GROMOS.17 Other force fields, such as OPLS18 and COMPASS19 were originally developed to simulate condensed matter; GAFF20 a force field developed to simulate organic compounds together with bio-macromolecules; and GLYCAM21 a force field specifically developed for carbohydrates. Both GAFF and GLYCAM are compatible with AMBER. These force fields vary slightly as to the functional form of the potential energy functional, mainly in the non-bonded terms, as well as values of specific atomic parameters. For more details the reader is referred to a recent review on current advances in empirical force fields for biomolecules.22 For coarse-grained systems, the most commonly used force field is MARTINI,23,24 which has been parameterised for lipids, proteins, carbohydrates and nucleic acids. Recently, a tool was developed to parameterise small molecules automatically.25 The MARTINI model is based on a four-to-one mapping, implying that about four heavy atoms are coarse-grained to a single bead. The beads interact predominantly by Lennard-Jones parameters together with harmonic bonds and angles.23 Other coarse-grained models commonly used are GROMOS26 and Elba.27
1.2.2 Basic Simulation Techniques
To explore the energy landscape described by the molecular mechanics force field, i.e. to sample molecular conformations, a simulation is required. This is also the route to relate the microscopic movements and positions of the atoms to the macroscopic or thermodynamic quantities that can be measured experimentally.28 There are two major simulation methods to sample biomolecular systems: molecular dynamics (MD) and Monte Carlo (MC) (Figure 1.2).
1.2.2.1 Molecular Dynamics
Molecular dynamics is based on Newton's second law of motion, which relates the force, F, acted upon an atom to its acceleration, a, i.e. the second derivative of the position, q, with respect to time, t (eqn (1.2))
where m is the mass of the atom. In a molecular simulation, time is discretised and the position after a small, finite time, Δt can be computed using a simple Taylor expansion (eqn (1.3))
and hence it is easy to see that the position q(t), velocity dq(t)/dt and acceleration d2q(t)/dt2 are sufficient for propagation of the molecular system. The acceleration can be computed from eqn (1.2) and the force F is obtained by differentiating the energy of the system.29
An MD simulation is setup by assigning initial velocities and positions to all atoms in the system. The velocities are usually randomly assigned, whereas the positions are typically taken from e.g. a crystal structure or idealised geometries. Thereafter, the force acting on each atom is calculated, giving the direction of movement. The atoms are moved in this direction, giving new forces on each atom, and the procedure is then repeated a number of steps. There are several numerical recipes describing how this integration of motion is done practically, e.g. leapfrog, Verlet or velocity-Verlet. They chiefly differ in the numerical stability and whether they in addition to propagating positions also propagate the velocities.29
A major limitation to an efficient sampling with MD is the discrete time step, Δt. It is desirable to choose a longer time step, which would give longer simulations with less computational resources. However, Δt is limited by the fastest motion in the simulated system. For an atomistic system, the fastest motion is the bond vibration between a hydrogen and a carbon atom, which limits Δt to about 1 fs. Therefore, these bonds are typically constrained in the simulations, allowing a 2 fs time step. An alternative is to increase the mass of the hydrogen atoms, effectively slowing down the bond vibration.30 In coarse-grained simulations, a much larger time step is possible, typically between 10 and 40 fs, depending on the model.23,27
A simulation obeying Newton's second law of motion can be shown to sample a thermodynamic ensemble with constant number of particles, volume and total energy (kinetic+potential). However, experiments are usually performed at constant temperature and either constant pressure or volume. To sample such an ensemble, the equations of motion have to be modified. In the case of constant temperature, a thermostat is required and there are many such algorithms. Common approaches include (1) modifying the velocities (e.g. weak-coupling), (2) introducing fictitious particles in an extended system (e.g. Nosé–Hoover), or (3) introducing friction (e.g. Langevin dynamics). An extensive discussion of different thermostats is however beyond the scope of this introductory chapter and interested readers are referred to a review on the subject.31 Similarly to temperature, constant pressure can be introduced by a barostat that modifies the volume of the simulated system. Common approaches include (1) scaling the box dimensions (e.g. weak-coupling), (2) introducing fictitious particles (e.g. Parinello-Rahman), or (3) introducing a piston. Some thermostats and barostats are better suited for systems far from equilibrium, whereas others are better for production simulations.
1.2.2.2 Monte Carlo Simulations
The other major sampling method, Monte Carlo (MC), is a statistical technique where new conformations are generated by a random walk in phase space by assigning random displacements to the internal degrees of freedom, i.e. bonds, angles and torsions. Naturally, all conformations are not equally likely and therefore, the sampling is biased such that conformations are generated with a probability prescribed by the thermodynamic ensemble of interest. The overwhelmingly most common way to accomplish this is by performing a Metropolis–Hastings test.32,33 In a Metropolis MC simulation, a new conformation is accepted with the probability, p (eqn (1.4))
where ΔU is the energy difference between the new and old conformation, k is the Boltzmann constant and T the absolute temperature. In practice, the energy of the new and old conformation is compared and if the new conformation is lower in energy it is retained for the next step. If it is higher in energy, the Boltzmann factor exp(−ΔU/kT) is compared to a uniform random number between 0 and 1, and if the Boltzmann factor is higher than the random number the new conformation is kept.
An MC simulation consists of a number of moves, which is a recipe on how to sample specific degrees of freedom. This can be a simple translation move where the center of mass of a molecule is displaced, a rotation about a torsion angle or a complicated, concerted move of several protein backbone atoms. A move is selected randomly followed by a Metropolis test of the new conformation and the procedure is repeated for a number of steps.
The Metropolis test illustrated above gives a canonical ensemble, i.e. constant number of particles, volume and temperature. However, it can be modified to allow for a volume change such that constant pressure is simulated. Furthermore, entire molecules, e.g. water, can be inserted and removed during the simulation, leading to a grand canonical ensemble.34 Thus, MC simulations are more versatile than MD simulations, but are heavily dependent on the construction of efficient moves. In addition, since MC only depends on the positions of the atoms, dynamic information is lacking, and MC cannot be used to e.g. estimate transport properties or diffusion constants.
1.2.2.3 Boundary Conditions
An important aspect of both MD and MC is the choice of boundary conditions.28 Typically, a molecule is solvated by a finite water shell, or inserted in a lipid bilayer, leaving some of the atoms facing vacuum. This is not good physical description of a biological system, and a well-used solution is to extend the system periodically in all three directions to represent a pseudo-infinite system, effectively removing the vacuum. Periodic boundary conditions can be used with various geometries, a cubic box, a rhombic dodecahedron, or a truncated octahedron. The latter scheme is common in simulations of biological macromolecules solvated in water, since it allows the least number of solvent molecules in the system and thus speeds up the computation. Although periodic boundary conditions are the most common choice, there are other solutions in use, e.g. spherical boundaries with addition of restraints.35
1.2.2.4 Enhanced Sampling Techniques
As mentioned earlier, efficient sampling is one of the major limitations of both MD and MC. MD trajectories might not reach all relevant conformations, for example short-lived transient states connected with a biological function. This problem can be addressed by employing enhanced sampling algorithms, such as metadynamics, steered MD or replica exchange MD. Often, the aim of such enhanced samplings is to build a more complete energy surface and/or to obtain free energy profiles or potential of mean force (PMF) data. Some of the more advanced features are covered in subsequent chapters of this book; we also refer the reader to recent reviews on enhanced sampling techniques.36,37
1.2.3 Basic Data Analysis
After a simulation has been completed, it needs to be analysed to extract relevant information about the system of interest. This can be quite challenging and depends very much on the type of a simulated system. Here, a few common strategies for analyses will be outlined.
1.2.3.1 Proteins
It is common to estimate the equilibrium of a protein simulation by computing the root mean squared deviation (RMSD; eqn (1.5)) of the backbone atoms compared to the starting conformation,
where N is the number of atoms, ri(t) the position of atom i, at time t. Prior to the analysis the protein need to be fitted onto the starting structure to remove the overall translation and rotation. Although this is a straightforward analysis and gives an indication of local equilibrium, it is a far too simple method to assess the global convergence of the simulation. It is also possible to compute a pair-wise RMSD between each snapshot in a simulation. This could for instance be used in order to evaluate how efficient the sampling has been, or if the simulation has become stuck in a local energy well (Figure 1.3). Whereas the RMSD provides an overall estimate for the entire protein and an approach to assess the degree of movement of individual residues is to compute the root mean squared fluctuation (RMSF; eqn (1.6)), which is simply the variance of the position of an atom:
where T is the total time of the simulation (or number of snapshots) and r̄ is the average position. The RMSF can be related to the B-factor used in crystallography by multiplying by 8/3π. The analysis can be done on a per residue-basis, where all the atoms of a residue is included in the average and can for instance be used to assess the movement of sidechains. Alternatively, one can include only Cα atoms in the analysis to assess the backbone movement.
To assess the overall compactness of a protein, the radius of gyration can be computed from eqn (1.7):
where M is the total mass of the protein, mi the mass of atom i and Rc the mass center. This is a simple analysis to determine if the protein is compact or extended. In order to obtain more specific analysis on the protein structure, one can analyse the secondary structure. It is possible to classify each amino acid to determine if it is part of a helix, a beta sheet or a loop. This is useful to monitor during the simulation in order to detect large conformational changes, i.e. loss of secondary structure. More complicated motions can be investigated with a principal component analysis (PCA). This is a statistical technique that reduces the dimensionality of problem; instead of looking at all 3N coordinates in a simulation, PCA reduces this to a few principal components that describes the major movements. The principal components are computed from the eigenvalues of a covariance matrix, describing the covariance of the positions of selected atoms. For a protein, typically the Cα atoms are analysed. The principal component can be projected onto the simulated system and visualised, enabling straightforward inspection of the major motions. This could for instance be a breathing movement of two protein domains or the outward movement of a loop area upon binding of a small molecule.
1.2.3.2 Nucleic Acids
The conformational space of DNA or RNA is quite diverse and dynamic, reflecting their ability to change depending on the physicochemical properties of the surrounding environment, local sequence motif, and interactions with other molecules. Thus, apart from RMSD and analysis of the inter and intramolecular networks of contacts, assessment of nucleic acids simulations is directed towards capturing such conformational interplay. The geometry of DNA, and to some extent RNA, can be described in terms of helical parameters (pitch and diameter of the helix), groove parameters (depth and width), furanose ring conformation, six torsional angles of the backbone, rotational (tip and inclination) and translational base pair parameters, six intra-base parameters (buckle, propeller, opening, shear, stretch, and stagger), and six inter-base parameters (tilt, roll and twist, shift, slide, and rise). The most popular programs that analyse nucleic acids simulations in terms of the mentioned degrees of freedom include Curves+ and Canal,38 and 3DNA.39 As nucleic acids exist and function as salts, the behaviour of the surrounding counterions is an integral part of the analysis, and could now be done with, e.g., the program Canion.40
1.2.3.3 Membranes
To assess the equilibration of a membrane simulation, several simple geometric properties are typically calculated such as the area and volume per lipid as well as the thickness of the membrane.41 The area and volume can be calculated straightforwardly from the box dimensions and by assuming the water density. There are several kinds of thicknesses that can be computed, but a simple one is to measure the distance between the peaks of the density of the phosphate atoms (or equivalent). It is also common to calculate an order parameter of the fatty acyl chain (eqn (1.8))
where θ is the angle between the bilayer normal and a carbon–deuterium bond in the acyl chain, and the brackets indicate an average over the simulation. For coarse-grained simulations, the bond between two neighboring atoms replaces the carbon–deuterium bond. It is therefore not correct to compare order parameters from atomistic and coarse-grained simulations. In both cases, the order parameter gives information on the phase of the membrane, i.e. if it is in the fluid or liquid-ordered phase.
1.2.3.4 Small Molecules
When performing simulations on small molecules (either as solvated entities, in a larger ‘bulk’ system, or as part of e.g. a biomolecular complex), an interesting analysis to perform is clustering. This will provide information on different kinds of conformations the molecule attains during a simulation and can be used to both assess if the simulation is stuck in a local conformation and to investigate the probability of different conformations. There is a plethora of different clustering methods available and it outside the scope of this text to discuss them at any length; the interested reader is referred to the literature.42
1.2.4 Software
Over the last two decades with the development of new simulation algorithms and new technologies in hardware platform design, molecular simulations have dramatically increased in size, length and system complexity. The appearance of a variety of standardised molecular modelling software packages, including GROMACS,43 Amber,44 CHARMM,45 GROMOS,46 and NAMD,47 has transformed the field of computational chemistry by commoditising molecular simulations and making it accessible to a broader group of researchers. All these packages have complementary strengths and profiles, with GROMACS and NAMD being two of the most popular. Considering GROMACS and NAMD as only MD engines there is no dramatic difference as to their performance, both work with a variety of force fields, and have GPU acceleration implemented. However, small differences should be mentioned, such as the possibility to perform QM/MM simulation in GROMACS, or NAMD's extensibility to user-written scripts. Both packages are distributed free of charge with source code. Moreover, for NAMD, there are downloadable binaries for a variety of platforms. This can be useful for a beginner in computational chemistry, as compilation of MD software might not always be straightforward. Both GROMACS and NAMD are parallel molecular dynamics engines, designed for high-performance simulations of large biomolecular systems, with GROMACS being better for simulations of smaller systems on medium-size supercomputers. To achieve the best performance for a particular system on a particular supercomputer we recommend initial benchmarking. For both GROMACS and NAMD a wide variety of tutorials are available. External software packages, like PLUMED,48 can provide additional functionality, such as enhanced molecular dynamics techniques mentioned above.
Various pieces of software are used for visualisation and analysis of molecular dynamics trajectories. Among the most popular and freely accessible tools are molecular modelling programs VMD49 and USCF Chimera.50 VMD (visual molecular dynamics) is a specialised molecular visualisation program for displaying, animating, and analysing molecular dynamics trajectories, extensively used with any MD software. USCF Chimera, on the other hand, is a highly extensible program for interactive visualisation and analysis of molecular structures and related data, including density maps, supramolecular assemblies, sequence alignments, docking results, trajectories, and conformational ensembles. Both programs could be used to create professional illustrations. VMD and USCF Chimera can also perform basic structural analysis, but for more extensive assessment of trajectories, such as clustering or modifications of system topologies, AmberTools and, in particular, cpptraj51 is recommended.
1.2.5 Examples
Molecular simulations have been used in numerous applications, to obtain the structure and dynamics of many biomolecules in order to elucidate biochemical functions, processes and pathways. In principle, any molecular system can be simulated, and two typical examples are shown in Figure 1.4. Although many of the simulations have studied individual macromolecules solvated in water or a model membrane, there has also been some effort in looking at larger assemblies. Already a decade ago, Schulten and co-workers used a massively parallel supercomputer to study the complete satellite tobacco mosaic virus, described with an all-atom force field.52 They were able to accumulate 50 ns of simulation time of a system containing roughly 1 million atoms and could conclude that the virus capside became unstable upon removal of the core RNA molecules. Using a coarse-grained model, Sansom and co-workers simulated the influenza A virion, a significantly larger system.53 By using a CG model, they could simulate at the microsecond timescale and in addition investigate alterations to the membrane envelope and sensitivity to temperature.
Long-time scales, such as micro- or milliseconds, are generally not accessible when using an all-atom force field. The exception is if special-purpose hardware is used as in the work from Shaw and co-workers. They reported the first continuous millisecond simulation of a protein described with an all-atom force field when they studied the folded structure of BPTI.54 Several distinct conformational states were found, separated by large kinetic barriers. Shaw and co-workers has also used simulation techniques to find flaws in current protein force fields, which could not previously be detected due to the typically short simulations. An alternative to running long, continuous trajectories that has become popular lately is to assembly many short simulations using Markov state models.55 They have for instance been used to study protein folding and how this immensely complicated process is affected by solvent effect and electrostatics.56
The above-mentioned examples represent extremes, both in terms of system size and length of the simulations. Molecular simulations of much more modest dimensions are routinely used to gain insight into biological processes and to complement wet-lab experiments. We find a very fruitful application of simulations in the field of enzymology.57 Here the substrate, a few active site residues and coordinating waters and ions are described with quantum mechanics, whereas the rest of the system is described with a molecular mechanics force field. Such simulations have been used in numerous applications to for instance elucidate enzymatic mechanism, understanding the nature of the catalytic power and enzyme design.57,58 Another common application of molecular simulation is the estimation of binding free energies of small molecules, e.g. drugs, to their targets. Jorgensen and co-workers routinely use such simulations to aid in their drug design pipeline.59 This is done by systematically introducing small chemical groups such as a hydroxyl or methyl group in a lead compound and evaluating its contribution with alchemical free energy simulation.
The above illustrations provide a few examples of what is possible with molecular simulations, with many more provided in the subsequent chapters of this book.
1.3 Protein Structure Prediction
Protein structure prediction is often listed among activities within the bioinformatics area, and essentially covers approaches enabling us to go from primary sequence (be it nucleic or amino acids), via secondary and tertiary structure, to quaternary structure and possibly also function of the resulting protein. This follows the central assumption that a proteins primary sequence and the inherent properties of the amino acid side chains dictate the final folded three-dimensional structure. Besides the above predictions, which are generally obtained through knowledge-based potentials or algorithms, or by comparing to already existing structures of systems with similar amino acid sequence, analysis of the quality of the resulting model is an essential part of protein structure prediction. We will in this section go through the different steps involved in structure prediction, including tools for analysis and some of the available software and web based solutions.
1.3.1 Sequence Alignment and Secondary Structure Prediction
Assuming we know the primary structure; that is, the amino acid sequence of our protein (or, if we have the DNA sequence, have translated this to the corresponding amino acids, and assuming there are no post-translational modifications that will alter the sequence), the next step is normally to search a database of protein structures for homologous sequences to which we can compare our query through sequence alignment. Normally, the coordinate repository of X-ray, NMR and EM structures of the RCSB protein databank from Brookhaven National Laboratory (www.rcsb.org), or any of its sister sites (PDBe, PDBj, BMRB), or a related database including refined protein structures, is used. To date the protein databank contains over 110 000 solved protein structures, with an annual growth of around 8500–9000 new entries, as well as structures of DNA, RNA and protein–nucleic acid complexes.
The search against databases to identify homologous sequences is normally performed using BLAST (Basic Logical Alignment Search Tool)60 or FASTA.61 Nowadays BLAST, which comes in several variants depending on type of algorithm, sequence and database, is the more common. A BLAST search systematically compares three-letter segments of the query sequence, referred to as words, to the database of templates step by step in a heuristic approach. For example, a sequence AHKRV is searched as the words AHK, HKR, KRV; this initial search is referred to as ‘seeding’. Comparison of words from the query sequence with words the database of known sequences is made both based on identity (each residue is matched perfectly), and similarity (similar function/property/size, but not identical), and a total score is calculated using a scoring matrix such as BLOSUM62 (BLOck Substitution Matrix).62 For example, according to BLOSUM62, an arginine matched by another arginine is given the value +5, arginine vs. lysine is +2, and arginine vs. cysteine is −2. Also, other scoring matrices and approaches exist, such as identical scoring matrix, minimal mutation distance matrix and point-accepted mutation (PAM).
After the 3-letter word search is done, the word length is extended to nearest and next-nearest neighbours, and possible alternative alignments assessed and scored. An example of the latter is shown in Figure 1.5A assuming a query sequence AHRKCCVGA to be matched against the template sequence AGRKKCVGGA, where different parts given as gaps or insertions provide different scores and result in shorter segments or additional loops of the modelled protein.
If we have several templates to compare against, we must, in addition to the possible alignments as above, also consider which of those alternative yet slightly mismatching sequences that fits the best; e.g. assuming we again have our query sequence AHRKCCVGA, is AHRKSVCVGGA or AHRACKVCVGA a better template (cf. Figure 1.5B)? In the case of multiple sequences to which we compare our query, this is referred to as multiple sequence alignment, of which the most common methods are the iterative PSI-BLAST63,64 and CLUSTALW.65 Sequence alignment approaches are also commonly used to explore similarity of a certain protein between different species, to identify conserved residues and motifs, and similar.
Once the best-scoring template(s) have been determined, we next compute the most likely secondary structure elements of the query sequence, again by stepwise comparing 3- and 5-letter segments and their likelihood of forming any of the common motifs. Each amino acid is scored (often using a scale from 0 to 9) based on probability to attain a certain structural motif in its local environment, and the prediction is compared to the template structure(s) that have been selected from the (PSI-)BLAST search. Secondary structure predictions are commonly displayed or mapped graphically in some way, e.g. red bar for helix, yellow arrow for β-sheet (cf. Figure 1.6A as an example). The aim is to determine which parts of our query sequence that locally are more likely to form α-helices, which sections that have a propensity for formation of β-sheets, where loops or coil-regions will be. This, again, is used to match the similarity against the obtained templates, and assist in the folding predictions. In addition, predictions are also frequently made based on the properties of amino acid sidechains, of which segments or parts of the sequence that are more hydrophobic or more hydrophilic. The rationale for this in accurate structure determination becomes obvious if we e.g. compare a globular protein present in the cytosol (hydrophilic surface and hydrophobic interior) with a membrane spanning ion channel (hydrophobic residues on the outside, interacting with the membrane lipids, and hydrophobic residues in the interior, lining the pore).
1.3.2 Comparative Modelling Approaches
Having determined suitable templates for our query sequence, and most likely secondary structure elements, i.e. the primary fold of the sequence, we next organise or pack our query structure according to the templates in order to generate a tertiary structure model. This is referred to as comparative protein modelling, and relies on the fact (as far as we can conclude from the currently available determined protein structures) that, although the number of possible proteins is essentially infinite, the number of folds is limited to approximately 2000 different ‘types’. That is, provided the sequence similarity (or ∼identity) between two proteins is sufficient, the two structures will in all likelihood have essentially the same backbone topology in the aligned regions.
Normally one separates the field of comparative protein modelling into homology modelling and threading/fold recognition. In homology modelling, we assume that if two sequences are so closely related that they can be satisfactorily aligned, they will also attain the same three-dimensional structure. This approach views the problem from the standpoint that folds are more evolutionarily conserved than the actual sequences. Clearly, the more identical the sequences, the better the model—if two sequences share 70% identity—the accuracy of the modelled structure is claimed to be similar to that of a crystal structure with resolution (Cα RMSD) in the range of 1–2 Å; at 25% identity, the structure corresponds to a resolution of 2–4 Å.
If there are no apparent homologous proteins identified, secondary structure prediction is necessary and compared to the template database, where after the sequence is ‘threaded’ according to the fold of the best-matching recognised template(s). This is also referred to as 3D–1D fold recognition, as it links a primary sequence to a three-dimensional structure. 3D–1D alignment is sometimes also included as an intermediate step in the above homology modelling.
There are a number of variants to the above, such as fragment assembly and segment matching; which in essence means that smaller parts of the sequence are modelled separately and then combined into a full protein structure.
In the unfortunate event that no reasonable templates can be identified, the ‘last resort’ is referred to as ab initio structure prediction. In this case the secondary structure elements are assembled stochastically, normally using a Monte Carlo type of algorithm, combined with refinement and (possibly) shorter simulations in order to generate a large number of potential three-dimensional models which are assessed and either discarded or improved further in successive iterations. One must, however, be very cautious when it comes to the interpretation of protein structures generated entirely without prior knowledge or templates, as the uncertainty in predicting the appropriate spatial arrangement between secondary structure elements is very high.
Regions of high flexibility, such as loops or the C- and N-termini, are normally poorly resolved or missing in X-ray crystal structures, and loop modelling is hence one of the approaches by which homology modelling can be used in order to improve a protein structure. Care should, however, be taken if the modelled loop is longer than ten amino acids.
Once a model is available, side chains need be optimised (packed) properly. This is done by successively evaluating the energetics for different rotameric states of the sidechains, either by actual energy calculations or using rotamer libraries, and determining the lowest possible (i.e. most stable) overall configuration. Finally, the model is generally subjected to some form of energy relaxation or minimisation. In Figure 1.6B, an example of a homology model is shown, and superposed to its template. The overall Cα RMSD of 1.75 Å over 433 residues, with 31.5% sequence identity between query and template. As seen, the agreement for the ordered secondary structure regions is very high, whereas the main deviations are noted in the loops and termini.
Several programs also have the capability to generate models from different templates, and merge the best-matching local segments thereof to construct a hybrid multiple template model. In general, one main template is in those cases used for the core structure, and replacing smaller fragments that are less accurately determined, such as loops or stretches where the sequence similarity to the ‘core template’ is particularly low.
1.3.3 Function Prediction
Predicting the explicit role or function of a protein, or different parts thereof, is more difficult, and is compared to the above protein structure determination still at a very early stage of development. In essence, protein function prediction relies heavily on identifying homologous regions either by sequence motifs or by 3D structure alignment, to identify possible domains and, by analogy with the identified templates, their specific roles. However, herein lies also the aspect of paralogs—proteins that have evolved from a common ancestor into structurally very conserved entities but where the functional role is entirely different—which makes the task even more difficult. It lies beyond the scope of the current introductory chapter to also cover such aspects, although we do mention some servers and other tools in Section 1.3.5 below.
1.3.4 Analysing the Quality of the Modelled Structure
Once a structure has been modelled, it is crucial to also assess its quality. Some quality checks are already embedded in the routines employed in the model development (such as BLAST E-value, BLOSUM score, side chain dihedrals and packing score), but it is recommended that a thorough assessment is made using some of the many servers and programs available. Some of the key tools are included herein.
The first assessment to be made is to produce a Ramachandran plot of the obtained structure, which displays the values of the ϕ and ψ angles of the protein backbone, and which provides a picture of the stereochemical quality of the amino acids. The RAMPAGE server (http://mordred.bioc.cam.ac.uk/∼rapper/rampage.php) is one of the main tools for this. A large number of outliers (i.e. values of the backbone torsional angles that do not fall into the allowed or generously allowed regions) indicates that the model has significant problems.
Secondly, the folding reliability can be evaluated using the Verify3D server (http://services.mbi.ucla.edu/Verify_3D). This assessment evaluates the likelihood that a particular residue in a particular sequence context part-takes in the predicted 3D fold, and provides an estimate of correct vs. incorrect folding for each amino acid. Scores below zero indicate serious folding problems, and 80% of the residues of a protein should attain values ≥0.2 in the 3D–1D profile for the model to have acceptable folding reliability.
The absolute quality of a model can be obtained using the QMEAN Z-score server (http://swissmodel.expasy.org/qmean/cgi/index.cgi), which calculates the quality of the model by combining six different structural descriptors such as secondary structure elements, solvent accessibility and torsional angles. The normalised mean value is then compared against the corresponding vales of a non-redundant set of high-resolution experimental structures of similar size (±10%) solved through X-ray crystallography.
Also, many other quality and property checks can be performed besides the basic tools listed above. Examples are determination of druggable sites and their relative scores, relating to docking simulations; see Chapter 1.4, using e.g. ProteinsPlus, http://proteinsplus.zbh.uni-hamburg.de (recently changed from DoGSiteScorer), and metal binding site interactions, if present, through the sever CheckMyMetal (CMM), http://csgid.org/csgid/metal_sites.
1.3.5 Software and Web Based Servers
There are several programs available for protein structure determination, both as standalone codes to be installed on the users own computer or cluster, and as web based servers. The current compilation is not intended to be exhaustive, but meant to provide a sample of different options available; most (but not all) of the programs listed are free for academic users. Each program has its pros and cons, and the reader is advised to read up on the different codes and approaches first, and to preferably test more than one code in order to build up experience in what functions the best for his or her particular needs. Lastly, a slight word of caution: Although there is a plethora of web based programs available, one must always remember that submitting your computation to someone else's computer (server) means you have no control over the results, including aspects pertaining to safety/security.
I-TASSER—server and downloadable (http://zhanglab.ccmb.med.umich.edu/I-TASSER).66
LOMETS—meta-server combining nine different programs (http://zhanglab.ccmb.med.umich.edu/LOMETS).67
MODELLER—a standalone program and server; several graphical interface programs are also available that use MODELLER (http://salilab.org/modeller).68
MOE—a standalone program, with license fee (http://www.chemcomp.com).69
PHYRE2—server (http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index).70
PRIME—part of the Schrödinger package; standalone, with license fee (http://www.schrodinger.com/Prime).
RAPTORX—server and downloadable (http://raptorx.uchicago.edu).71
ROBETTA—server, and as part of the downloadable Rosetta3 package (https://web.archive.org/web/20150819163428/http://www.robetta.org).72
SWISS-MODEL—server (http://swissmodel.expasy.org).73
YASARA—standalone, minor license fee (http://www.yasara.org).74
For protein function or domain prediction, fewer programs are available as yet. However, we refer the interested reader to any of the following servers:
1.4 Computer-based Drug Design
Computer-based drug design (CBDD) or computer-aided drug design (CADD) refer to the application of different computational methodologies and algorithms for developing bioactive compounds. It is currently an independent discipline within computational chemistry, mainly because it focuses on predicting/designing the next potential bioactive molecule to synthesise and test. It is well known that drug research and development is not only time-consuming, but also very expensive. It has been estimated that developing a new drug from idea to market would take ca. 14 years with an associated cost from 800 million to 2 billion dollars.78,79 In fact, the overall cost is increasing every year, mainly for specialised drugs for smaller patients populations.80 This emphasises the benefits of applying computational tools in the early stages of drug discovery, thereby reducing the cost, the required time and the inherent risks such as late-stage failure.81 The well-known ‘fail fast, fail early’ pharmaceutical mantra is the goal.82,83 While high-throughput screening (HTS) of large compound libraries is still the major source for discovering new hits in drug design, CBDD is currently playing a key role in the search for novel bioactive compounds, both in pharmacy and academy.84 A comparison between the two techniques has been reported in the screening for novel inhibitors of Protein Tyrosine Phosphatase-1B (PTP1B), an enzyme implicated in diabetes. The HTS of 400 000 compounds resulted in 85 hits actually inhibiting the enzyme (0.021%). On the other hand, 365 high-scoring molecules were obtained from the virtual screening, 127 of which inhibited PTP1B (34.8%). These results clearly showed that CBDD increased the hit rate over random (HTS) screening.85 Thus, the application of computational tools allows for covering a larger part of chemical space and at the time the number of compounds that must be synthesised, is drastically reduced.
CBDD can be classified into two main classes: structure-based drug design (SBDD) and ligand-based drug design (LBDD).86,87 SBDD is based on the knowledge of the 3D structure of the target protein, using virtual screening techniques to search for molecules having complementarities toward the selected target. For SBDD, molecular docking, virtual screening and molecular dynamics are the most important underlying methodologies.88 LBDD does not require knowledge of a protein, instead using the information provided by known active and inactive compounds to find potential hits by similarity searches or quantitative structure–activity relationship studies (QSAR).87 The latter is usually the selected methodology when there is no structural information available of the target system.
1.4.1 Pre-requisites for SBDD—Sampling Algorithms and Scoring Functions
Molecular docking is a methodology that attempts to predict the conformation of ligands within the receptor binding site.89 The identification of the ‘best pose’, i.e. the ligand internal conformation and orientation towards the receptor, involves searching the ligand conformational space (sampling or posing) and ranking of the predicted binding conformations (scoring).
1.4.1.1 Sampling Algorithms
Docking most frequently deals with ligand flexibility and in some cases with protein flexibility. Virtual screening employing a large ligand library (virtual high-throughput screening, VHTS) is a very time- and resource-consuming procedure. Therefore, usually the ligand and receptor are both treated as rigid bodies (rigid ligand and rigid receptor docking) in the initial screening. Even though the search space is restricted, large libraries can be rapidly explored and filtered. More often, molecular docking employing smaller libraries treats the ligand as flexible, while the receptor is kept fixed (flexible ligand and rigid receptor docking). Finally, incorporating receptor flexibility is the most accurate and costly methodology, and it is usually employed for refining previous docking rounds (flexible ligand and flexible receptor docking).90,91 Treatment of ligand flexibility includes systematic, stochastic, and simulation methods.92–94
1.4.1.1.1 Systematic Methods
These methods account for ligand flexibility by exploring the conformational space of the molecule. After search of a ligand's degrees of freedom, the method converges to the most likely binding mode. As with all ‘down-hill’ methods, a systematic search can converge to a local minimum rather than the global one, a problem that can be overcome by performing several searches starting from different initial ligand conformations.95 When exploring all possible degrees of freedom in a ligand (exhaustive search), the number of possible combinations is usually prohibitive, facing the so-called problem of combinatorial explosion. An alternative to exhaustive search is to employ incremental construction algorithms.93 Docking programs such as DOCK,96 FlexX97 and Glide98 apply an incremental construction search method, in which the ligand is first divided into fragments. One fragment is selected as anchor (usually the larger fragment) and docked in the binding site. The remaining fragments are incrementally added until the entire ligand is built.90
1.4.1.1.2 Stochastic Methods
Stochastic methods perform a random search of the conformational space by making random changes to the ligand or a population of ligands. Such changes include translational, rotational and internal modification of the ligand's coordinates. This strategy allows for finding the global minimum, covering also a larger conformational space. MC, genetic algorithms and tabu search are typical algorithms belonging to this class.92,94
MC methods make random modifications to the ligand structure and the resulting conformation is tested according to the metropolis criterion, which accepts conformations with a lower energy, and higher energy states when the Boltzmann factor is greater than a random number between 0 and 1 (see also Section 1.2.2). Programs such as Autodock99 and MOE69 employ MC methods for sampling.
Genetic algorithms (GA) have their roots in the Darwin's theory of evolution and natural selection. All structural parameters are encoded in genes, and a particular pose is referred to as a ‘chromosome’. The random search algorithm then generates several chromosomic mutations (i.e. several poses), which are in turn evaluated in terms of energy. The best adapted chromosome will be the one with the lowest energy and thus selected to be used in the next generation. The next generation is populated with poses having increased favourable structural characteristics. After several generations (several conformational search cycles), the energy minimum conformation is reached.100 Programs implementing genetic algorithms are Autodock101 and Gold.102
Tabu search (TS) is a heuristic method originally proposed by Glover in 1986.103 The algorithm proceeds stepwise from a conformation, generating a number of moves to the current solution. The moves are scored, ranked using the energy function and examined. The method keeps a list of the previously visited solutions, and a move is considered ‘tabu’ if it generates a solution that is not sufficiently different from the previous ones. The algorithm calculates the root mean square deviation (RMSD) between the current move and the all previously recorded solutions. Only those movements having a RMSD smaller than a cut-off are accepted. The tabu search continues for a user-defined number of iterations.104 Examples of programs implementing tabu search are MOE69 and PRO_LEADS.105
1.4.1.1.3 Simulation Methods (MD)
MD simulations (cf. Section 1.2.2) are also used in the context of molecular docking, allowing for representing the flexibility of both the ligand and the receptor. However, MD is not the best method for simulation of ligand–target interactions, mostly for its intrinsic difficulty to cross high-energy barriers, leading to a poor sampling. Considering that MD simulations are efficient at exploring the local hyper surface, the best approach is to use a systematic or random search in order to find the most likely conformation for the ligand, followed by MD simulations.90 For techniques dealing with receptor flexibility in particular, see Section 1.4.2.
1.4.1.2 Scoring Functions
Molecular docking programs predict binding conformations employing sampling algorithms, and their evaluation to estimate the energy of the ligand–target interaction is crucial. To this end, scoring functions are employed aiming to rank the complexes and discriminate correct poses from incorrect ones. Therefore, the design and proper use of scoring functions is of utmost importance in SBDD. Scoring functions can be classified into four types: force field-based, empirical, knowledge-based and consensus scoring functions.90,92–94,106–109
1.4.1.2.1 Force Field-based Scoring Functions
Force field-based scoring functions estimates the binding energy using classic molecular mechanics formulations, calculating the sum of the non-bonded interactions (i.e. electrostatic and van der Waals terms). The electrostatic terms are calculated by a Coulombic formulation using a distance-dependent dielectric function to account for charge–charge interactions, whereas the van der Waals terms are usually described by a Lennard-Jones potential function (see Section 1.2.1). The parameters of the Lennard-Jones term can modify the ‘hardness’ of the potential, which in turns changes the distance between the receptor and ligand atoms. The most important limitations of force field-based scoring functions include the introduction of cut-off distances for the treatment of non-bonded interactions, which reduce the accuracy in calculating long-range effects involved in binding. In addition, force field scoring functions do not estimate entropic contributions and solvation energies.110 The results of scoring with force field-based functions can be refined through calculation of linear interaction energy (LIE),111,112 inclusion of generalised solvation through the Born model (MM/GBSA)113 and free-energy perturbation methods (FEP).114–117 Programs such as Gold,102 Dock96 and AutoDock101 employ force field-based scoring functions.
1.4.1.2.2 Empirical Scoring Functions
Empirical scoring functions fit parameters to reproduce experimental data, such as binding energy, originally proposed by Böhm.118,119 The binding energy is decomposed into several weighted terms, such as hydrogen bond, hydrophobic contact terms, desolvation energy, ionic interactions and binding entropy. The coefficients of each term are calculated from a regression analysis using experimental information from a training set of ligand–protein complexes with known binding affinities. Although the empirical scoring functions are simple to evaluate, the major drawback is their dependence on the training set employed and thus the transferability of the weighted parameters.120–124 ChemScore120 and FlexX97 are examples of programs using empirical scoring functions.
1.4.1.2.3 Knowledge-based Scoring Functions
These functions are designed to reproduce experimentally determined complex structures. They are based on the assumption that the more favourable interatomic distances occur with higher frequency and the algorithms model those frequency distributions as pairwise atom-type potentials. The score is calculated as a sum of the individual interactions. The functions are computationally simple allowing for screening large compound databases, but as they rely on the training set employed for deriving the parameters an extensive use is limited.125–128 Knowledge-based potentials have been implemented in potential of mean force (PMF)126,129–131 and DrugScore.132
1.4.1.2.4 Consensus Scoring Functions
A more recent strategy is the introduction of consensus scoring functions. This scoring scheme combines different scoring functions aiming to improve single scores and increase the ligand enrichment (i.e. the percentage of occurrence of strong binders among high-scoring ligands). However, when the employed scoring functions are significantly correlated, the method could magnify the calculation errors, rather than attenuate them.133–135 CScore implements consensus scoring by combining Gold, DOCK, PMF, ChemScore and FlexX scoring functions.136
1.4.2 Structure Based Drug Design (SBDD)
SBDD makes use of high-throughput virtual screening (HTVS) techniques to search for bioactive compounds, identifying hits out of thousands of molecules by detecting complementarities between the ligand and the biological target. Selected compounds are ranked employing different scoring functions. Eventually the selected hits are experimentally evaluated to assay the biological activity on the selected target.137,138 SBDD consists on the following key steps: (1) preparation of the target receptor, (2) compound database selection and preparation, and (3) molecular docking (i.e. determination of a favourable binding pose for each compound and ranking of the docked structures).
The first step involves the preparation of the target receptor, which indeed is of uttermost importance because the representation of the active site affects the quality of ligand posing and scoring. Experimentally determined structures of many receptors are available, mostly through X-ray crystallography and NMR spectroscopy. When the receptor structure is not experimentally available, it is possible to create a model starting from the sequence and applying homology modelling (see Section 1.3). Once the receptor model has been selected it has to be prepared for molecular docking studies, usually by adding hydrogen atoms, removing water molecules unless they bear important interactions, calculating partial charges and assigning tautomerisation states.139 The initial selection of the structure is critical in the sense that small conformational changes arising from ligand binding highly influence the results, e.g. when using holo-, apo-proteins or homology models as targets.140
The second step is the selection and preparation of the small-molecule database. Several public databases containing millions of compounds and chemical information are freely accessible.141,142 The most common chemical databases used in VHTS are ZINC,143,144 PubChem,145,146 DrugBank,147,148 ChemSpider,149,150 ChemBank,151,152 eMolecules,153 ChEMBL,154,155 ChemDB,156,157 and Binding DB.158,159
Preparing the original libraries requires different aspects depending on the software employed to perform the screening, but commonly involves the correct assignment of stereochemistry, partial charges and ionisation state according to the selected pH. In addition, several filters may be applied in order to enrich according to expected physicochemical properties of the potential ligands. Usually, filtering the database according to Lipinski's rule of five is performed to ensure drug-likeness.160–163
The third step involves docking experiments of the prepared small-molecule database into the prepared receptor binding site, and the analysis of the resulting docked conformations (Figure 1.7).
Exploration of ligand flexibility was described in previous sections of this chapter. The biological molecules are intrinsically mobile and consequently the representation of molecular flexibility of receptors is an important aspect of SBDD.165 Incorporating receptor flexibility is a challenge in molecular docking due to the evident computational cost of modelling multiple degrees of freedom. Methods for accounting for receptor flexibility include the use of soft potentials (soft-docking), use of rotamer libraries, inclusion of side chain flexibility, and to perform ensemble docking.166 Soft-docking decreases the van der Waals repulsion term energy in the scoring function allowing for partial overlapping between the receptor and the ligand. This method is simple but does not include suitable flexibility.167 Employing rotamer libraries involve searching within the library to obtain possible conformations of the residue side chains. Even though it is efficient in terms of computation, it is highly dependent on the database used and ignores backbone flexibility.168 Including side chain flexibility consist in sampling several side chains of the receptor simultaneously with the ligand sampling using for instance genetic algorithms. The main drawback is that only selected side chains are accounted for (the others are treated as rigid) and the backbone is not considered flexible. Finally, using an ensemble of protein conformations as obtained from e.g. NMR experiments allows for docking the ligand into several receptor structures (i.e. different conformations), considering in each round a flexible ligand and the receptor as a rigid body. This method completely accounts for flexibility, but only for those protein conformations included in the sampling.169,170 Even though several pitfalls of SBDD are well-known,171,172 it has been successfully employed in identifying potent hits in many drug discovery studies.173–181
1.4.3 Ligand Based Drug Design (LBDD)
LBDD involves the analysis of ligands known to interact with the selected target. Several molecular descriptors are calculated for a set of reference compounds, i.e. compounds known to be active, which in turn are applied as molecular filters. Thus, the filtering is employed to select compounds sharing characteristics with the reference set. These methods do not require any information of the structure of the receptor. A distinct approach is the construction of a QSAR model predicting the biological activity from chemical structure.182,183
Different molecular descriptors can be calculated and the selected set depends on the biological function to be predicted. Molecular descriptors can be 2D (depending only on the topological connectivity) or 3D (depending on the geometry). 2D descriptors include physical properties such as atomic charges, polarisability, log P (logarithm of partition coefficient between n-octanol and water), solubility in water, volume, number of hydrogen bond donor/acceptor atoms, and molecular weight. 3D descriptors include properties such as total energy (and its components), ionisation potential, and HOMO/LUMO energy.94,184–186
After accounting for molecular descriptors, fingerprint techniques (similarity search) may be used to search databases for compounds similar in structure to a query (usually a lead compound).
Quantitative Structure–Activity Relationship (QSAR) models describe the mathematical relation between structural features and target response of a set of compounds.187,188 The method involves the inclusion of active and inactive ligands, thus creating a set of mathematical descriptors. The subsequent step consists of the generation of a model establishing the relationship between those descriptors and the experimental biological activity of the compounds. Finally, the model is applied to predict the activity of compounds of interest.94
1.4.4 Pharmacophores
Pharmacophore models are of fundamental importance in drug design when no structural data is available. IUPAC have defined a pharmacophore as ‘the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response. The pharmacophore can be considered as the largest common denominator shared by a set of active molecules’.189 Thus, a pharmacophore is a collection of structural properties relevant for biological activity, a purely abstract concept, rather than a real molecule. Pharmacophoric descriptors include hydrogen-bond donors and acceptors, hydrophobic, aromatic, acidic, basic and ionisable groups.
A pharmacophore model can be ligand-generated, i.e. by superposing a set of bioactive molecules and extracting common chemical features responsible for the biological activity, or structure-based, i.e. by determining the main interactions between the target and the active ligands.190 The latter can be obtained by analysing the structure of ligand–receptor complexes (either from crystal structures of from docking experiments), particularly the chemical features of the active site and the interactions with an active compound (Figure 1.8). The pharmacophore model must then fit the selected features (Figure 1.9).191
A more challenging problem is to generate a ligand-based pharmacophore model, and involves the following steps: (1) identifying the relevant properties, (2) superposing the molecules according to those properties, and (3) generating the pharmacophore model. The most demanding issue to address is the development of algorithms for effective molecular superposition, ensuring that a maximum number of chemical features overlap geometrically.192,193 This in turn incorporates the problem of conformational flexibility, that can be addressed by the pre-enumerating method (multiple conformations of each molecule are included into a database), or by performing a conformational analysis during the pharmacophore modelling process as requested by the alignment algorithm (the so called on-the-fly method).190,193 Once the ligands have been aligned, a pharmacophore feature map is extracted. A more general property definition increases the population of compounds matching the pharmacophore. This allows for identifying new compounds but also increasing the rate of false positives.94,194
Ligand-based pharmacophore modelling has become an essential computational strategy for drug discovery in the absence of structural information about the target. Several programs incorporate pharmacophore construction, such as Catalyst (part of Biovia Discovery Studio),195,196 Phase197,198 accessible by Schrödinger's graphical interface Maestro,199 and MOE.69,200
1.4.5 Compound Optimisation
The last step of computational drug discovery involves the modification of the hits in order to improve the biological activity by changing the chemical structure, the hit-to-lead process. This optimisation involves increasing the drug potency (two- or three-fold), selectivity and pharmacokinetics, including absorption, distribution, metabolism, excretion and potential for toxicity (ADMET).201,202
In order to increase the biological potency of detected hits, similarity search employing pharmacophoric models is a valuable tool.203
Focusing on ADMET properties, several filters can be employed to sort-out compound libraries, such as Lipinski's rule of five as mentioned in Section 1.4.2. This set of rules is related to those properties thought as necessary for good oral bioavailability and mainly targeting eukaryotic receptors.204 However, when focusing for instance on antibiotics, the target might be located in the peptidoglycan matrix or the outer surface on the inner membrane. Then, permeation through the inner lipid membrane is not required to kill the pathogens and Lipinski's rules are simply not followed.205 Therefore, different filtering rules may be needed depending on the particular biological target of interest. Moreover, several biologically active compounds violate more than one of the Lipinski's rules, such as atorvastatin (Lipitor®) and montelukast (Singulair®),206 evidencing that automatic filtering might artificially remove potential leads.
Metabolic stability of drugs is a desirable property in the sense that when it is lowered, the drug diminishes its efficacy and increases the risk of generating toxic metabolites. Cytochrome P450 enzymes (CYPs) are major drug-metabolizing enzymes and the prediction of compounds that would be metabolised by, or inhibit CYPs must be assessed.207,208 In terms of SBDD, the off-target prediction (focusing on available structures of CYPs) must be performed, aiming to determine the affinity of potential hits towards different receptors other than the main biological target. Besides accounting for metabolism, the off-target prediction may include any relevant human protein where inhibition would lead to toxic side effects.
1.4.6 Software and Web Based Servers
In previous sections of this chapter several programs were mentioned, describing the sampling algorithms, scoring functions and type of drug design scheme included (SBDD or LBDD). Apart from those software packages, several web based servers for molecular docking and virtual screening are available. DOCK Blaster is a web server version of UCSF DOCK allowing for screening ZINC databases subsets.209,210 SwissDock allow for ligand selection (ZINC ID, URL specification, an internal curated database, or an uploaded file).211,212 DockThor is an online receptor-ligand docking facility allowing for uploading receptor, ligands and cofactor structures.213,214 PharmMapper server is an integrated pharmacophore matching platform for potential target identification (off-target binding), very useful when predicting potential toxicity of developed hits.215,216 These are some examples, many more web-based resources can be found.
Funding from the Swedish research council (LAE, AR), the Faculty of science at University of Gothenburg (LAE), the Wenner-Gren foundations (SG) and the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007–2013) under REA grant agreement no 608743 (PSM) are gratefully acknowledged.