Jihyun
Shim
and
Alexander D.
MacKerell, Jr.
*
HSFII, 20 Penn St., Baltimore, MD 21201, USA. E-mail: amackere@rx.umaryland.edu; Fax: +410 706 5017; Tel: +410 706 7442
First published on 18th April 2011
A significant number of drug discovery efforts are based on natural products or high throughput screens from which compounds showing potential therapeutic effects are identified without knowledge of the target molecule or its 3D structure. In such cases computational ligand-based drug design (LBDD) can accelerate the drug discovery processes. LBDD is a general approach to elucidate the relationship of a compound's structure and physicochemical attributes to its biological activity. The resulting structure–activity relationship (SAR) may then act as the basis for the prediction of compounds with improved biological attributes. LBDD methods range from pharmacophore models identifying essential features of ligands responsible for their activity, quantitative structure–activity relationships (QSAR) yielding quantitative estimates of activities based on physiochemical properties, and to similarity searching, which explores compounds with similar properties as well as various combinations of the above. A number of recent LBDD approaches involve the use of multiple conformations of the ligands being studied. One of the basic components to generate multiple conformations in LBDD is molecular mechanics (MM), which apply an empirical energy function to relate conformation to energies and forces. The collection of conformations for ligands is then combined with functional data using methods ranging from regression analysis to neural networks, from which the SAR is determined. Accordingly, for effective application of LBDD for SAR determinations it is important that the compounds be accurately modelled such that the appropriate range of conformations accessible to the ligands is identified. Such accurate modelling is largely based on use of the appropriate empirical force field for the molecules being investigated and the approaches used to generate the conformations. The present chapter includes a brief overview of currently used SAR methods in LBDD followed by a more detailed presentation of issues and limitations associated with empirical energy functions and conformational sampling methods.
Drugs typically exert their pharmacological effects by specific interactions with their target proteins. Such unique interactions have been understood as “lock-and-key”,13 “induced-fit”,14 “conformational selection” or “population shift” hypotheses,15–19 which are based on the inherent chemical structure of molecules, their dynamic conformational properties and how those two influence the receptor. Therefore, identifying any causation or correlation between structures and activities, referred to as a structure–activity relationship (SAR),20–22 can be of utility for ligand design. LBDD based SAR identifies similarities and/or differences in structural or physicochemical properties of compounds and relates them to activity, including efficacy (eg. activation or stimulation of receptors, Vmax of enzymes), affinity (eg. Ki), selectivity (eg. Ki, isoform1/Ki, isoform 2), pharmacokinetics (ADME),23,24drug-drug interactions, or any biological properties of interest. Various descriptors of the ligands are related to biological activities through various statistical methods, for instance, regression, classification, dimension reduction, variable selection, etc. from which important features of the ligands responsible for activity are identified and used to develop new leads or to optimize known ligands.
Three major categories of LBDD are quantitative structure activity relationship (QSAR),25–27 pharmacophore modeling,28–32 and similarity searching.33–36 Over several decades, statistics, computational algorithms, and descriptors comprising the three categories and their pipelining have led to significant improvements both in efficiency and accuracy. Programs can deal with 100–1000s of molecules to build models or search molecular properties against databases of millions compounds in a short period of time. Overall improvements have been achieved by sophisticated data mining techniques and by more accurate mathematical descriptions of molecules through molecular mechanics (MM)37 and quantum mechanics (QM) methods.38
Recent advances in statistical, algorithmic and chemoinformatics in relation to LBDD have been discussed elsewhere in depth.39–42 This chapter will briefly overview LBDD followed by a detailed presentation of new developments in the areas of conformational sampling and force fields (FF) with respect to LBDD.
Development of SAR models often requires pre-processing of the descriptors prior to model development. Values are often normalized such that coefficients obtained from fitted models represent the significance of the individual descriptors. Importantly, given the large number of possible 1D/2D descriptors it is necessary that the number of descriptors used during model development be limited by selection methodologies.50,51 In simple terms highly correlated descriptors are typically removed from model development. Descriptor selection is then often linked to model development itself. Those descriptors most predictive of a target property are selected by iterative analysis (stepwise multiple linear regression (MLR),52 replacement method53) or by learning algorithms (Genetic algorithm,54 adaptive fuzzy partition algorithm,55 Gaussian processes,56 or Genetic function approximation (GFA)57). Correlated or redundant descriptors may also be eliminated by partial least square (PLS)58–61 or principle component analysis (PCA).62,63
Given a suitable training set (i.e. set of compounds with known biological activities) and descriptors, one applies statistical methods according to the characteristics of the data set. When linearity is present MLR and PLS are good choices. Nonlinearity64,65 occurs when one handles a large number of non-homologous data sets, for example in pharmacokinetic (PK) studies where multiple biological phenomena such as absorption and metabolism can impact the biological data, or when activities are influenced by many factors such as receptor dimerization, existence of receptor isoforms, and conformational changes. Selection (or design) of the experimental data for model development can minimize nonlinearity, thereby reducing possible ambiguities in the developed QSAR but users need to keep in mind the inherent complexity of biological phenomena. Selection of the appropriate statistical methods also depends on whether the goal of the study is interpretation or predictability. Classical methods such as MLR produce explicit physical meaning but predictability is not as good as using modern statistical tools that improve predictability. However, interpretability is often compromised with improved predictability. While the MLR method, which maximizes interpretability, was the most used method in 2008, it was followed by PLS & support vector machine (SVM) approaches that yield improved predictability. Newly developed statistical methods39 include Gene Expression Programming (GEP),66 Project Pursuit Regression (PPR),67 Local Lazy Regression (LLR)68 while recent variations in QSAR approaches are hologram-QSAR,69 auto-QSAR,70 and inverse-QSAR71 among others. Hologram-QSAR69 partitions molecules into smaller fragments and uses size, length, and as well as additional information on those fragments as descriptors. It is usually combined with PLS to derive a QSAR. Auto-QSAR70 is an automated QSAR where the best descriptors, the best statistical methods, and validations are chosen for given set of molecules and updated as the number of molecules in training set increases. In Inverse-QSAR71 after a QSAR model is built, distributions of descriptors yielding optimal activity are estimated and structures are generated or searched that match those distributions.
As a last step in QSAR model development, the models require validation.72,73 Approaches to do this include cross-validation, y-randomization, or external test set. It is generally perceived that leave-one-out or leave-n-out cross-validations do not necessarily indicate predictability directly and external validation using compounds not included in model development is recommended to verify predictability of the developed models. An overview of statistical methods used in QSAR analysis is given in Table S1 of the supporting information.
Method | Description |
---|---|
CoMFA | Probes placed on grid points in the 3D field around a molecule experience an interaction energy with the ligands that defines the molecular shape and electrostatic properties in the surrounding environment. |
CoMSIA | It expands CoMFA by including hydrophobic and hydrogen bonding contributions and calculates how these contributions are similar between molecules. |
GRIND | It eliminates alignment-dependency by using distances between 3D grid points. Highly relevant regions among a set of molecules are selected as nodes and the intensity of molecular interaction field at those nodes are used as descriptors. |
VolSurf | Information on 3D grid voxels (shape, electrostatic, volume) are compressed into 2D numerical descriptors by image analysis tools. |
4D-QSAR | Multiple conformations in a grid box generate the occupancies at grid points, with those occupancies used as the descriptors. |
5D-QSAR | Multiple hypothetical binding pockets are generated around ligands based on a 3D grid and the receptor models are evolved by GA with the most favorable binding pocket model evaluated by relative free energy of ligand binding. |
6D-QSAR | Includes optimization of structures in aqueous solution and calculates solvation energy and charges by semi-empirical QM method, AMSOL.72 Ligands' arrangement in pseudo-binding pocket is determined by MC simulation. |
Of the methods in Table 1, CoMFA and CoMSIA are the most widely used. Their main limitations are their dependency and sensitivity to conformations and alignments of the molecules under study. Different occupancies by different conformations or changes in molecular alignments can cause different interaction fields yielding different QSAR models. To overcome the limitations, alignment-independent 3D-QSAR was developed by transforming 3D-grid data into 2D descriptors such as GRIND (Grid independent descriptors)84,85 and VolSurf.84 The approaches are listed and summarized in Table 1.
Various methods are employed to generate multiple conformations of a ligand. Systematic search approaches115–117 formally perform an exhaustive sampling of conformational space, thereby covering the whole energy surface. However, as the degrees of freedom increases the number of possible conformations becomes enormous and often includes non-physical structures. Systematic search procedures therefore often limit the number of conformations accessed and select conformations within a user-defined energy difference range. For example, MOE supports a maximum of up to 10000 conformers and each one is subjected to trial model buildup. An alternative to systematic sampling is Monte Carlo based approaches118–120 where random changes in structures (i.e. trial moves) are attempted by rotation about a dihedral angle or other geometric change with the new conformations associated with the trial moves accepted or rejected according to the Metropolis criterion.119,120 In the Metropolis method a conformation with an energy, ΔE, lower than that of the previous conformation is accepted while conformations with higher, less favorable energies are kept based on the acceptance probability, p = e−ΔE/kT where k is the Boltzmann constant, T is the temperature and p is compared to a random number. If p is greater than the random number the conformation is accepted, allowing higher energy conformations to be sampled. In this method energy barriers are easily overcome by increasing the effective temperature but random elements still exist during sampling leading to inefficiencies due to similar conformations being accessed. The Poling method121 adds a penalty function (poling function) to the energy during the conformational search that is inversely proportional to the root mean square distance between conformations so sampling of similar conformation is avoided while accessing new conformation is maximized. With the same goal as the Poling method, Tabu search122 keeps a record of previous sampled states thereby maximizing the exploration of previously unsampled conformations. When GA123,124 is applied for conformational sampling, each chromosome is a conformer and contains genes corresponding to structural degrees of freedom in the molecule. Chromosomes undergo mutations and crossover resulting in the sampling of diverse conformations in the descendant conformations. Molecular dynamics (MD) simulations sample conformations deterministically according to Newton's equations of motion and overcome trapping in local minima due to the inclusion of kinetic energy in the system, as described below. All of the methods mentioned in this paragraph have advantages and disadvantages. For small molecules systematic search algorithms in combination with an accurate force field, as discussed below, can assure that all relevant conformations are taken into account. With larger molecules, exhaustive sampling of all accessible conformations is not feasible, MC or MD methods allow for extensive sampling of accessible conformations, though care must be taken to assure that all the relevant conformations are being sampled. One outstanding advantage of both MD and MC methods is that a variety of methods that allow for detailed representation of the biological environment of a molecule by, for example, the explicit treatment of waters and ions have been developed for these approaches. Given the wide use of MD methods for conformational sampling as well as for studies of the dynamics of molecules ranging from small ligands to large macromolecular complexes containing 1 million or more atoms, the remainder of the discussion of conformational sampling will focus on MD based approaches.125–129
MD simulations126,130,131 are based on Newton's equations of motions. The second law F = ma states that from position ri(t), velocity vi(t), and mass mi for an atom i at time t, force Fi(t) can be calculated. In MD simulations, forces are usually obtained from analytical derivatives of the potential energy function and integration methods are used to obtain new positions ri(t + δt) and vi(t +δt) from the previous states, ri(t), vi(t), and ai(t). For example, the original Verlet integrator132 uses a Taylor series expansion of position. Summing eqn (1a) and 1b yields 1c which determines the new position. This integration minimizes memory requirements as it is not necessary to store velocities, although they can readily be calculated using eqn (1d) if required.
(Equation 1a) |
(Equation 1b) |
r(t + δt) = 2r(t) − r(t − δt) + a(t)δt2 | (Equation 1c) |
(Equation 1d) |
Leapfrog Verlet integration133,134 uses an expansion of positions and velocities to the second order and an interval 1/2δt instead of δt. Subtraction of eqn (2b) from 2a yields 2c. New positions may then be obtained by substituting v(t) from rearrangement of 2a into v(t) of 1a and truncating the expansion at the velocity term yielding eqn (2d).
(Equation 2a) |
(Equation 2b) |
(Equation 2c) |
(Equation 2d) |
Extended integration methods have been developed to enhance accuracy and provide special features for the simulation system of interest.135 For example, for simulations of aqueous solutions it may be desirable to reproduce constant pressure, temperature, or volume in accord with the specific ensemble being targeted. For example, simulations in the constant pressure, temperature and number of particles ensemble (NPT) may be used to calculate Gibbs free energies while a constant volume, temperature and number of particles ensemble (NVT) yields Helmholtz free energies. However, such simulations require the appropriate boundary conditions, such as periodic boundaries.135 While these are necessary for sampling the conformations of ligands in the presence of explicit solvent, for LBDD MD simulations are typically performed in the absence of explicit solvent. Such simulations may be performed in the “gas phase” or using implicit solvent models to treat the solvent environment, as detailed below.
High temperature MD has long been used to facilitate the crossing of high energy barriers to assure a broad sampling of conformational space. In high temperature MD, the probability of particles having the necessary velocity (or kinetic energy) to cross energy barriers is increased over room temperature simulations. While sampling in MD is driven by information on the molecular forces thereby guiding conformational sampling to physically meaningful regions, unwanted sampling may occur in high temperature MD. This is due to the high temperature leading to sampling of conformations that are inaccessible at room temperature thereby causing inefficient use of computational effort. However, care to avoid excessively high temperatures can minimize this problem and a number of protocols, referred to as simulated annealing,136 perform high temperature MD followed by room temperature MD to assure that conformations relevant to the latter are being sampled.
A simple way to improve sampling via MD simulations is to perform multiple simulations of the system starting with different initial random number seeds to assign the velocities to the particles in the system. Typically, a Gaussian distribution of velocities are randomly generated and assigned to each particle with those initial velocities satisfying a Maxwell-Boltzmann distribution defining a selected temperature. While the overall velocity distribution is approximately reproduced with the different random number seeds yielding the same macroscopic temperature, the individual atoms have different velocities, thereby directing the molecule to sample different conformations. However, this approach does not always avoid kinetic trapping for larger molecules due to large barriers often associated with large conformational changes.
Methods that go beyond the use of high temperature and multiple MD runs are referred to as generalized ensemble (GE) algorithms.137–139 These include replica exchange MD (REXMD),140meta-dynamics,141,142 accelerated MD (AMD)143 and λ-dynamics144 among others. In GE algorithms energy barriers are overcome by adding an external biasing potential(s) to the system. This may be performed by accessing additional conformations from additional simulations, as in REXMD, or by approaches that directly modify the free energy landscape of the system. Many GE MD simulation approaches sample the free energy landscape efficiently and may be used to calculate accurate free energy differences. The free energies are often calculated by thermodynamic integration (TI)145 or the weighted histogram analysis method (WHAM).146
Standard REXMD involves parallel independent simulations (replicas) at a range of temperatures and exchanges conformations between replicas according to an exchange probability (eqn (3)).
(Equation 3a) |
(Equation 3b) |
(Equation 3c) |
In eqn (3)Ti indicates the temperature of replica i, qi is the configuration of replica i at the point of exchange, and Ei represents Hamiltonian energy of replica i. The main idea behind REXMD is that one MD trajectory in a local minima can take conformational information from another replica which may be found in another region of conformational space (e.g. across an energy barrier) but have similar energies. The probability of exchange between replicas is such that it enforces sampling of a Boltzmann distribution of conformations, thereby satisfying a proper thermodynamic ensemble as defined by the simulation conditions. Implementation of REXMD is not straightforward, with issues including how to set up the proper temperature spacing, the number of replicas, and the exchange frequency. For large systems with explicit solvent, REXMD requires a large number of replicas and small temperature difference between adjacent replicas to achieve an acceptable exchange ratio. To overcome this implicit solvent models, as discussed below, may be used thereby allowing for a significant increase in the difference in temperature between adjacent replicas. For example, in the presence of explicit solvent replicas may have temperature differences of 10 K while when implicit solvent is used 30 K may be used. In addition, hybrid methods have been developed.147 Concerning, exchange frequency, higher exchange frequencies typically lead to enhanced sampling.148 However, care must be taken as although high temperature enhances barrier crossing, it may shift the equilibrium between two states and make high temperature states more favorable throughout all replicas.
Hamiltonian replica exchange molecular dynamics (HREXMD)149–151 overcomes drawbacks of REXMD by scaling the potential energy function (i.e. Hamiltonian) rather than the temperature (eqn (3c)). Perturbation of the Hamiltonian can involve almost any term in a force field such as the peptide backbone conformational energies, dielectric constant or ligand-solvent interactions. HREXMD needs a lower number of replicas than REXMD since the perturbation is applied locally on selected components of the system. Generally the perturbation is expressed as a function of an order parameter λ in
H(λi) = λiH1 + (1 − λi)H0 | (Equation 4) |
There are single MD approaches using dynamic variations of λ. Lambda-dynamics144,152 implements λ as an artificial particle that is propagated during the MD simulations thereby sampling various λ values as dictated by the free energies of the system without the need of using pre-defined λ values. The Hamiltonian is expressed by adding two more terms for dynamic λ variables to H(λi), as shown below, and {λi} is used since multiple λs can be used to perturb different components of the Hamiltonian such as electrostatic or dihedral energies. In the approach the extended Hamiltonian, comprised of the standard, ground state Hamiltonian and the perturbations associated with the λ terms is defined as
(Equation 5) |
Another single dynamics GE method, Metadynamics141,142,153 uses a history-dependent biasing potential to force selected degrees of freedom (e.g. collective variables, CV) of the system being sampled away from conformations visited frequently. This is performed by “lifting” low energy regions with a biasing potential as those regions are being sampled, thereby facilitating conformational changes away from the low energy regions. The biasing potential is the sum of Gaussian functions, VG, that are used to fill valleys of the free energy surface as defined as follows
(Equation 6) |
Taking a similar strategy, the orthogonal space random walk (OSRW) algorithm154,155 is another efficient way of conformational sampling. This strategy simultaneously perturbs the order parameter space (general term for λ or CV above) and generalized free energy space to overcome not only local minima trapping but also lagging of changes in the environment surrounding the CV or λ required for conformational changes to occur. OSRW uses 2D Gaussian-shaped repulsive potentials to flatten the free energy surface and avoid often-visited states. After searching the whole conformational space it is possible to select accessible conformations by order parameters associated with the conformational change.
Another approach is accelerated MD (AMD).143AMD is a simple but efficient sampling method that has shown good performance for biomolecules. Compared to metadynamics and OSRW, it uses a simpler form of the biasing potential (eqn (7)). When applying the method, the boost energy E and α, which is a tuning factor for the biasing potential's well-depth, need to be pre-defined and the biasing potential is applied when the potential energy is less than E.
(Equation 7) |
REXMD and HREXMD have been successfully used for conformational sampling of flexible ligands,113,156–158 while the other GE algorithms have been used mainly in biomolecules to date. However, considering their success in conformational sampling, problems involved in flexible protein loops and ligand passage in receptors,159,160 it is anticipated that they will be of utility to conformational sampling in LBDD.
In addition to the sampling algorithms, an important consideration is the role of solvent in conformational sampling. Conformational changes and sampling are dependent on the surrounding environment of all molecules such that energetically favorable conformations in gas phase may not be favorable in solution (or a receptor binding site) and vice versa. This occurs due to COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundwater competing for favorable intramolecular interactions, for example, by disrupting intramolecular hydrogen bonds or reduced intramolecular dipole–dipole interactions. Alternatively, COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundwater can impact the orientation of hydrophobic groups, which may remain “accessible” in the gas phase, but cluster together in the presence of solvent. Therefore sampling conformations in the presence of explicit COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundwater molecules is ideal when studying biological systems but such calculations are computationally expensive due to the presence of additional particles in the system. In addition, the viscosity of the water surrounding a molecule can slow conformational sampling during MD simulations, thereby further increasing the computational costs. An effective alternative to explicit solvent are continuum solvent models that allow for distributions of conformations to be obtained that approximate an explicit solvent environment while allowing for efficient sampling by avoiding the increased number of particles and the viscosity issues with COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundwater. A large number of implicit solvent models are available and recent reviews on the topic have been presented.161–163 Simple implicit solvent models use constant or distance-dependent dielectric constants. Another approach is using solvation parameters for each atom based on their solvent accessibility so that particles in the system have varying responses to environments. More accurate solutions to solvation effects are obtained by the Poisson–Boltzmann (PB)164,165 or Generalized Born (GB)166 models. In MD, the most widely used methods are GB models due to the high efficiency of their analytical solution and comparable accuracy with respect to PB models that typically require numerical solutions. GB models calculate the electrostatic component of the solvation free energy as shown in eqn (8), with that term added to the total energy.
(Equation 8) |
In eqn (8) ε is the dielectric constant of solvent, qi is the partial atomic charge of atom i, rij is the interatomic distance between atom i and j, and Ri, which is the most important parameter in GB models, is the effective Born radii of atom i. The effective Born radii can be understood as an atom's degree of burial within the solute or radius exposed to the solvent environment. A number of GB models have been developed such as GBMV167 and GBSW168 in CHARMM169,170 and GBOBC162 in AMBER171,172 which differ primarily in the way that the effective Born radii are calculated.
The final portion of this section presents a simple example of sampling the conformational space of a peptide using three MD-based sampling approaches based on the CHARMM22/CMAP protein force field.173Fig. 1 shows the extent of sampling attained by MD in both vacuum and explicit solvent (Fig. 1a and 1b) and by two sampling methods (single MD vs. HREXMD using explicit solvent, Fig. 1b and 1c) for the flexible opioid pentapeptide, Leu-enkephalin. For the explicit solvent HREXMD, λ = 0, 0.14, 0.19, 0.27, 0.37, 0.52, 0.72, and 1.0 where 1.0 represents a CHARMM phi, psi energy surface that is flat for non-glycine residues, as implemented using the CMAP tool.173 Simulations were performed using the REPDSTR module in CHARMM for 10 ns, attempting exchanges every 0.5 ps for the HREXMD. The range of conformational sampling was measured by the 2D probability distribution of the distance between two aromatic rings (A and B) and the angle between two aromatic rings and N-terminal nitrogen (N). Additional details of the simulation methodology are included in the supporting information. Comparison of Fig. 1a and 1b show that the sampling of conformational space by MD differs in gas phase vs. explicit solvent. In explicit solvent, structures with longer AB distances and larger ANB angles are being sampled. These represent more extended structures due to the presence of solvent; in the gas phase the peptide folds back on itself leading to more compact structures as no solvent is available to compete for intramolecular interactions. As expected, HREXMD (1c) samples a similar range of conformations as the standard explict solvent MD (1b), but the extent of sampling is more complete using the same amount of simulation time. The more complete sampling by HREXMD is due to the simulation overcoming an energy barrier in the vicinity of 10–12 Å for the AB distance and 60–100° for the ANB angle. Although this example is not a rigorous test from which better performance can be proved based on more efficient sampling, it points out the importance of the simulation method and solvent environment when performing conformational sampling.
Fig. 1 2D-probability distribution of AB distance-ANB angle pair of Leu-Enkephalin. Pharmacophoric point A represents the centroid of the aromatic ring of COMPOUND LINKS Read more about this on ChemSpider Download mol file of compoundtyrosine, B is the centroid of the aromatic ring of COMPOUND LINKS Read more about this on ChemSpider Download mol file of compoundphenylalanine, and N is the basic nitrogen. a) through c) compares different sampling of conformational space by a) gas phase MD, b) explicit solvent MD and c) explicit solvent HREMD. Simulation details are in the supporting information. |
An important consideration when performing conformational sampling is the extent of convergence; have all the accessible conformations of the molecule been sampled? When conformational sampling is done by MD, convergence of sampling may be checked by continuing the simulation time and testing if additional conformations are being sampled. If additional conformations are not being sampled the sampling may be considered converged. For conventional MD, root mean square deviations of overall structure, distance between atoms or functional groups (as in Fig. 1), or torsions may be used for simple evaluation of convergence. Alternatively, differences in probability distributions between two intervals of a trajectory (e.g. the first and second half) can also indicate if the simulation has reached convergence. For GE methods, convergence of the calculated free energy surface indicates adequate sampling. However, it should be emphasized that the appearance of convergence does not necessarily mean that true convergence has been attained. There is always the possibility that a molecule, especially more complex molecules such as polypeptides, may have access to significantly different conformations than those accessed in the performed simulations.
A force field consists of a potential energy function and the associated parameters that allow the energy and forces to be calculated as a function of the molecular structure and conformation. Potential energy functions used in molecular mechanics typically include terms for bond stretching, angle bending, rotation around bonds (dihedral or torsion angles), out of plane motions (improper angles), and non-bonded interactions (electrostatic and van der Waal energies). Such force fields are referred to as Class I models. In Class II force fields crossterms to treat correlation between bonds and angles, angles and torsions, and so on are included and different variations of the nonbond terms may be used. Eqn (9) and 10 show examples of potential energy functions in the two classes, which are classified by their simplicity and potential transferability. Detailed descriptions of the example functional forms are found in Brooks et al.169 for the class I force field in CHARMM and in Plimpton174,175 for a class II FF. Class I FFs include those specialized for biomolecules (see below). The simple functional form shown in eqn (9) is computationally efficient allowing them to handle macromolecules in aqueous or other condensed phase environments. Eqn (10) is a typical energy function used in a Class II FF and employs more complex form which facilities (but doesn't necessarily dictate) transferability across a wider range of molecules.
(Equation 9) |
In eqn (9) b, θ, S, n, φ, δ, and ω represent the bond length, angle, distance between 1–3 atoms, multiplicity, torsion, phase, and improper angle, respectively, and the subscript zero indicates the equilibrium geometry parameter. Kx are the associated force constants. The Lennard-Jones 6–12 equation is commonly used to model the van der Waals energy between atom i and j, where εij, rij, and Rmin,ij are the well-depth, interatomic distance, and interaction distance at minimum the energy between atoms i and j, respectively. In the electrostatic or Coulombic term, qi and qj are partial atomic charges, and ε is the dielectric constant.
For eqn (10) the same symbols are used as in eqn (9) and additionally b′ and θ′ used in the crossterms represent the second bond or angle associated with the cross interaction.
(Equation 10) |
To move from a potential energy function to a FF requires determination of the values of the parameters, a process referred to a parameter optimization or parametrization. Parameters to be optimized in eqn (9) and (10) include force constants, equilibrium geometries, partial atomic charges, well depth and interaction distance at minimum energy (Rmin,ij) and so on. The goal of parameter optimization is reproducing a collection of quantum mechanical and/or experimental observables for the ligands of interest.
Parameters for macromolecules such as proteins, nucleic acids, lipids and carbohydrates have been paid special attention and optimized extensively in Class I FFs such as AMBER(Assisted Model Building with Energy Refinement),171,172 CHARMM(Chemistry at Harvard Molecular Mechanics),169,170 GROMACS176 and OPLS(Optimized Potential for Liquid Simulations).177,178 Subsequently, parameters for drug-like molecules were added to be compatible with the individual biomolecular FFs while maintaining the accuracy of the “parent” biomolecular FF. To achieve this, the parent and related small molecule parameters use the same form of the potential energy function and strategy used for optimization of the FF. This is necessary to provide consistent and balanced energy and force evaluations during simulations of small molecule-biomolecular complexes. For example, the CHARMM General FF (CGenFF)179 follows the standard optimization procedure of the CHARMM additive biomolecular FF.180
As an example of parameter optimization the approach used in the CHARMM additive force field for small molecules, with which we are intimately familiar, will be used. Typically, optimization of CGenFF parameters is performed as follows. Force constants, equilibrium bond lengths and valence angles are parametrized to reproduce experimental or QM vibrational frequencies and geometries. Dihedral angle force constants, phases, and multiplicities are optimized targeting QM potential energy scans or spectroscopic data such as NMR J coupling constants. Charges are optimized by evaluating optimal distances and interaction energies of water interaction with the drug-like molecule based on QM calculations as well as dipole moments and optimization of LJ parameters is guided by the reproduction of pure solvent or crystal experimental data. To date, CGenFF includes approximately 150 atom types, 400 bond, 1200 angle, 3000 torsion parameters explicitly optimized based on 500 model compounds. When extending the force field to new chemical entities, parameters for the new molecules not already available in the CGenFF may be assigned by analogy for the bond/angle/dihedral and LJ terms, while determination of the partial atomic charges is based on a bond-charge increment algorithm extended to included angle- and dihedrals increments that has been trained to reproduce CGenFF charges for over 500 model compounds (K. Vennomeslaegh and A.D. MacKerell, Jr., work in progress). A web-based utility in the context of the ParamChem project is available to perform these functions. An important feature of CGenFF when automatically assigning parameters is information about the quality of the assigned parameter based on a penalty score. This is important as the ability of parameters to be transferred between molecules in the context of empirical force fields is limited, as shown below, and it allows users to know which parameters require validation and further optimization to obtain the required level of accuracy. However, as with conformational sampling, even in cases where the parameters are directly transferred to a new molecule, the possibility that those parameters may not perform with adequate accuracy exists, such that the aware user is advised to perform validation tests of the transferred parameters, as previously described.179
Beyond CGenFF there are a number of other small molecule FFs designed to be compatible with biomolecular FFs. GAFF (General AMBER FF)181 was developed for the simulation of pharmaceutical compounds with the AMBER biomolecular FF. It is based on QM optimization of about 3000 model compounds and geometric information from the Cambridge structure database (CSD).182 It has 57 atom types, 700 bond length parameters, 3000 angle parameters, and 500 dihedral angle parameters. Beyond these available parameters, the Antechamber toolkit183 is used to assign parameters for novel molecules. SwissParam is a web-based utility used to generate CHARMM consistent parameters for ligands. It takes internal energy parameters and charges from MMFF (Merck Molecular FF)184,185 while cubic and quadratic terms for bond, angle, and improper energies that are present in the Class II force field are truncated as required for use with the Class I CHARMM additive FF. In addition, van der Waals energy parameters are from the CHARMM additive FF based on atom type similarity. However, it should be noted that the nonbond parameters being derived in a different manner than that of the parent biomolecular FF make them formally incompatible with the CHARMM biomolecular FF. OPLS-AA (All Atom) emphasizes parameters to reproduce the conformational energetic and condensed phase properties of small molecules for use in biological environments. Initial parameters were adopted from the OPLS-UA (united atom), AMBER, and CHARMM FFs and 50 model compounds were optimized focusing on torsion and non-bonded parameters.177 OPLS-AA uses experimental liquid properties as target data during parameter optimization. Thus, Class I biomolecular FFs have been extended to include parameters for a range of small molecules though the extent of chemical space covered and the quality of the parameters for those molecules vary significantly.
Class II FFs were initially designed to treat a wide range of small molecules. Examples include CFF/CVFF (consistent valence FF),186MM2 (molecular mechanics),187MM3,188MM4,189MMFF94,184 and Tripos 5.2 FF.190 These FFs are typically not optimized with respect to interactions with the environment, with the exception of MMFF, limiting their applicability. In general, Class II force fields were optimized to reproduce geometries, vibrational spectra and conformational energies in the gas phase, with the various cross and higher order terms in the energy functions (eqn (10)) included to allow for both better reproduction of those properties as well as facilitate transferability of the parameters to a wider range of compounds. MMFF94 is currently one of the most widely used FFs for small molecules and is available in numerous LBDD software packages for small molecule simulations. Its goal is broad applicability and QM data for over 3000 molecules and condensed phase data for 2800 CSD compounds were used to optimize and validate the parameters. Allinger and coworkers have developed the MM1-4 FF series achieving high accuracy for organic molecules with respect to geometries, conformational properties and heats of formation. Upon going from MM1 to MM2, the MMx series shifted to a simpler form of Class II FF and MM3 lead to further improvements by including more model compounds, additional experimental data, and higher energy conformations during parameter optimization. MM4 represents a further extension of a Class II FF due to the inclusion of four-fold torsional energy terms, torsion-improper-torsion cross terms, bond-torsion-bond cross terms, two torsion-bond cross terms for central and terminal bonds each and so on. MM4 was optimized targeting thermodynamic quantities ΔH, ΔS, ΔG, and geometries from QM calculations or experimental spectroscopic data. All of these class II FFs are primarily utilized for organic compounds in the gas phase, though MMFF94 has seen limited use in macromolecular condensed phase simulations.
Use of a FF for energy evaluation, energy minimization, MD simulation or other sampling approach represents a significant, important step forward in most modern LBDD studies and the quality of the FF plays an important role in the outcomes of such studies. As emphasized in the preceding paragraph, the various FFs were optimized targeting a training set of molecules. Accordingly, each FF may be anticipated to reproduce the energies and forces of the molecules in the respective training sets with reasonable accuracy. However, the question of transferability remains such that how accurate is the treatment of a molecule not in the training set originally used to optimize the FF. While a full investigation to address this issue represents a significant challenge, two examples of the transferability of selected FFs will be given targeting QM dihedral potential energy scans of COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compounddimethyltryptamine and dimethylamino[1,4]diazepine which are analogues of COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundserotonin and COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundclozapine.
Fig. 2 shows dihedral potential energy surface generated by MMFF, SwissParam, CGenFF, and, in 2b, CGenFF after additional parameter optimization along with QM data obtained at the MP2/6-31G* level using the Gaussian03 program.191 Dihedral angles, shown as curved arrows in the figure, were rotated in 15° increments and the geometries were optimized at each step. MMFF, SwissParam and CGenFF parameters were input into CHARMM and geometries were minimized to an RMS force of <10−6 kcal mol−1/Å. With COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compounddimethyltryptamine, MMFF and SwissParam underestimated the height of energy barriers and CGenFF had a different peak shape at 45° and 330°. Since SwissParam adopted parameters from MMFF, its energy surfaces were similar to that of MMFF, though not identical. This is due to the different representations of the nonbond terms, which contribute to the energy surfaces and emphasize the problem with mixing parameters from different force fields. Overall, the shape of the surfaces for COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compounddimethyltryptamine are acceptable for all three FFs. However, results for dimethylamino[1,4]diazepine emphasize that caution needs to be taken as when transferring parameters to new compounds (Fig. 2b and Table 2). The MMFF energy surface has local minima around 150° and 270° which will lead to errors in conformational sampling; similar problems are present with the parameters generated by SwissParam. When energies as a function of conformation are incorrectly represented the conformations selected from the sampling approach will typically be incorrect or the probabilities of those conformations improperly represented. Initial parameters from CGenFF showed poor agreement with the QM PES, but the FF is in significantly better agreement following optimization of selected dihedral angle parameters. The results with all the tested FFs indicate the limited ability to transfer parameters to new molecules. An advantage of CGenFF is that penalty values are provided for each parameter assignment. This alerts the user to possible limitations in the FF, such as occur in dimethylamino[1,4]diazepine's conformational energy. In such cases validation of the parameters and additional optimization should be performed as required. Efforts to extend the ParamChem web server to include an automated interface for parameter validation and optimization are ongoing (K. Vennomeslaghe, S. Pamidighantam, M. Sheetz and A.D. MacKerell, Jr. Work in progress).
Fig. 2 Comparison of conformational energy surface. A) is potential energy surface (PES) of COMPOUND LINKS Read more about this on ChemSpider Download mol file of compounddimethyltryptamine and B) shows that of COMPOUND LINKS Read more about this on ChemSpider Download mol file of compounddimethylamino-dibenzo[1,4]diazepine. |
FF | RMSD | |
---|---|---|
COMPOUND LINKS Read more about this on ChemSpider Download mol file of compounddimethyltryptamine |
COMPOUND LINKS Read more about this on ChemSpider Download mol file of compoundclozapine |
|
MMFF | 0.87 | 5.04 |
SwissParam | 1.42 | 5.42 |
CGenFF | 0.62 | 12.84 |
CGenFF* | n.d. | 3.59 |
For proper CSP modelling, accurate FF and efficient conformation searching are needed. To date, MMFF and CGenFF have been used successfully. Conformational sampling has used extended MD simulations alone at both room and high temperatures112–114 and temperature REXMD simulations in implicit solvent.113,158,192Fig. 3 shows the general procedure used in CSP modelling. Once conformations are pre-enumerated for the training set of compounds, which may be performed using any of the above sampling methods, pharmacophore development is performed in an automated, computationally feasible fashion. As for identification of pharmacophore features, aromatic ring, ionizable groups, or hydrogen bond donors and acceptors can be identified, the associated probability distributions between the features calculated and, based on the extent of overlap of those distributions, the various combination of overlaps iteratively regressed against biological data, with those features yielding the best correlation with experimental data used for further model development. Notably, the CSP method can be readily combined with physicochemical descriptors to further facilitate model development.158,192
Fig. 3 Schematic diagram of CSP procedures. |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c1md00044f |
This journal is © The Royal Society of Chemistry 2011 |