A Machine Learning Based Intramolecular Potential for a Flexible Organic Molecule

A Machine Learning Based Intramolecular Potential for a Flexible Molecule. One limitation of the accuracy of computational predictions of protein–ligand binding free energies is the fixed functional form of the intramolecular component of the molecular mechanics force fields. Here, we employ the kernel regression machine learning technique to construct an analytical potential, using the Gaussian Approximation Potential software and framework, that reproduces the quantum mechanical potential energy surface of a small, flexible, drug-like molecule, 3-(benzyloxy)pyridin-2-amine. Challenges linked to the high dimensionality of the configurational space of the molecule are overcome by developing an iterative training protocol and employing a representation that separates short and long range interactions. The analytical model is connected to the MCPRO simulation software, which allows us to perform Monte Carlo simulations of the small molecule bound to two proteins, p38 MAP kinase and leukotriene A4 hydrolase, as well as in water. We demonstrate that the accuracy of our machine learning based intramolecular model is retained in the condensed phase, and that corrections to absolute protein–ligand binding free energies of up to 2 kcal/mol are obtained. Abstract One limitation of the accuracy of computational predictions of protein–ligand binding free energies is the ﬁxed functional form of the intramolecular component of the molecular mechanics force ﬁelds. Here, we employ the kernel regression machine learning technique to construct an analytical potential, using the Gaussian Approximation Potential software and framework, that reproduces the quantum mechanical potential energy surface of a small, ﬂexible, drug-like molecule, 3-(benzyloxy)pyridin-2-amine. Challenges linked to the high dimensionality of the conﬁgurational space of the molecule are overcome by developing an iterative training protocol and employing a representation that separates short and long range interactions. The analytical model is connected to the MCPRO simulation software, which allows us to perform Monte Carlo simulations of the small molecule bound to two proteins, p38 MAP kinase and leukotriene A4 hydrolase, as well as in water. We demonstrate that the accuracy of our machine learning based intramolecular model is retained in the condensed phase, and that corrections to absolute protein–ligand binding free energies of up to 2 kcal mol − 1 are obtained.

The interplay of the intramolecular, or internal, energy of a molecule and the non-bonded energetics that determine its interactions with its environment play a crucial role in simulations of protein folding, 3 crystal structure prediction, 4 protein-ligand binding, 5 and many more. In particular, oral drugs are typically flexible, containing on average 5.4 rotatable bonds, 6 and are therefore capable of populating many free energy minima both in solution and also when bound to their target. Computational analysis has revealed that the majority of ligands bind in a conformation within 0.5 kcal mol −1 of a local energy minimum. 7 To be successful, docking or any other method used in computer-aided structure-based drug design must be able to accurately predict both the bioactive conformation of the molecule and the free energy change that accompanies its binding from solution.
The potential energy surfaces of organic molecules for practical applications are typically modelled using transferable molecular mechanics force fields such as AMBER, 8 CHARMM, 9 GROMOS 10 or OPLS. 11 When combined with molecular dynamics (MD) or Monte Carlo (MC) sampling, these force fields may be used to predict, for example, liquid properties of small molecules, 12,13 structural propensities of peptides, 14,15 and protein-ligand binding free energies, 16,17 all with reasonable accuracy. The intramolecular component of the force field is typically modelled by harmonic bond and angle potentials to represent two-and three-body terms, respectively, an anharmonic torsional term to model dihedral rotations, and Coulomb and Lennard-Jones terms to account for interactions between atoms separated by three or more bonds. 8,18,19 Details vary between these transferable force fields, but the fixed functional form is common to all. Thus, no matter how carefully the force field is parametrized, accuracy will ultimately be limited by the choice of this functional form.
For the description of intramolecular energetics, quantum mechanics (QM) is seemingly preferable and is frequently used in computational enzymology applications. 20 However, the computational cost associated with QM simulations is high, particularly for free energy predictions which require extensive (alchemical and/or conformational) sampling. 21 In order to make a calculation tractable the level of QM theory (basis set and exchange-correlation functional, for example) is often compromised, which again raises questions over the final accuracy. 22 Alternatively, one can construct direct fits to the high dimensional QM potential energy surface of the molecule. There is a wide range of methods available for fitting bond, angle and dihedral parameters of the MM force field to QM energies, gradients and Hessian matrices, [23][24][25][26][27][28] and these often include extended functional forms such as cross-terms to account for coupling between internal coordinates. 29 However, for larger, more flexible molecules, longer-range atomic interactions beyond the 1-4 interaction are also crucial in determining molecular conformation. For these molecules, a consistent, accurate approach to approximating the QM potential energy surface is key. Rather than relying on human intuition to decide on the most appropriate functional form this MM force field should take, it is preferable to harness the recent advances of machine learning inspired techniques. There are several neural network and kernel based techniques recently developed for material systems that can predict quantum energies and forces with remarkable accuracy. [30][31][32][33][34][35] Since these potentials need to be trained on only a few thousand (well dispersed) configurations, the underlying quantum mechanical data can be of high accuracy while maintaining affordable computational expense. These techniques have been successfully used to reproduce the atomisation energies [36][37][38][39] and QM potential energy surfaces 40-42 of a range of organic molecules.
With accurate energies and forces, the opportunity arises to begin to use machine learning based potentials in molecular dynamics simulations of organic molecules, 43,44 although so far studies have tended to focus on gas phase dynamics. Relatively little is known about the performance of these potentials in the condensed phase, where free energy basins that are unpopulated in the gas phase may emerge. 45 Such considerations are especially important for free energy simulations in computer-aided drug design involving molecules with multiple degrees of freedom where accurate sampling of conformational space is required.
In this work, we investigate the feasibility of using machine learning in developing accurate representations of the potential energy surface of a flexible, "drug-like" molecule for use in e.g. structure-based lead optimization efforts. We employ the Gaussian Approximation Potential (GAP) 46 framework that is based on a sparse Gaussian process regression technique, which is formally equivalent to kernel ridge regression. 47,48 GAP uses both QM energy and gradient information and although it was originally developed for material systems, it has been used successfully to describe molecular properties. 36,37 Here we use it to create a potential energy surface for 3-(benzyloxy)pyridin-2-amine (3BPA, Figure 1). Although still somewhat smaller than typical drug-like molecules (molecular weight of 200, three rotatable bonds, and three hydrogen bond donors/acceptors), it represents a challenging test case for machine learning due to its internal flexibility. The high effective dimensionality of the configurational manifold of the molecule requires a relatively large amount of training data extensively sampled from the potential energy surface. To address these challenges, we developed an iterative protocol to gradually improve the reconstructed potential, and applied sparsification techniques to cope with the relatively large amount of training data.
Despite its small size, 3PBA has been identified in two separate fragment screens as an efficient ligand of p38 MAP kinase 1 and leukotriene A4 hydrolase. 2 In the former study, although its binding affinity was found to be greater than 1 mM in an enzyme bioassay, 3BPA has a clearly defined binding mode to the hinge region of the ATP binding site of the kinase (Figure 1(b)). 1 While in the latter, the same compound binds near the bottom of the substrate binding cleft of leukotriene A4 hydrolase with sub-mM affinity (Figure 1(c)). 2 To investigate the binding of 3BPA in these two environments, we have interfaced GAP with the MCPRO software. 18 MCPRO is a tool for structure-based lead optimization through the use of free energy perturbation (FEP) theory combined with Monte Carlo sampling of protein-ligand binding modes. It has been used for the successful computationally guided design of inhibitors of targets including HIV-1 reverse transcriptase 49 and macrophage migration inhibitory factor, 50 and has recently been applied to the fragment-based design of inhibitors of the Aurora A kinase -TPX2 protein-protein interaction. 51 As we will show, our interface between GAP and MCPRO allows us to perform Monte Carlo simulations of 3BPA in a range of environments, and to compute corrections to the binding free energy that is computed using molecular mechanics force fields.

Creating a Gaussian Approximation Potential
We now briefly outline the Gaussian Approximation Potential (GAP) 46 framework, and how we apply it to create a potential energy surface for 3BPA that reproduces quantum mechanical energies to within 1.0 kcal mol −1 root mean square (RMS) error. GAP has been applied to many different materials and compounds, 52-62 and has been described in detail elsewhere, 63,64 and so we summarize here only the main features. Although the probabilistic and linear regression viewpoints are entirely equivalent, we follow the latter here because it is likely to be more familiar, and we will not be making use of the uncertainty estimates and parameter optimization techniques that follow naturally from the former.
The main idea of potential energy surface fits, and the way in which they go beyond conventional force fields, is that the potential energy is explicitly written as a generic function of all atomic degrees of freedom, without making assumptions about separability (e.g. into body ordered terms such as bond and angle potentials). Thus the fit to the potential energy is high dimensional. The basis functions for the fit have to be of a kind that allow systematic convergence to the a priori unknown target potential energy surface, and this has consequences for the attainable accuracy as a function of the amount of input data, for transferability to configurations far away from the distribution from which the training configurations were drawn, as well as for the overall cost of evaluation of the potential energy fit. Typically the high dimensional fits are significantly more expensive to evaluate than the short range terms of a conventional force field, though they are still of course much cheaper than a QM calculation, or the evaluation of the electrostatic potential of a large system that includes a protein and explicit solvent molecules.
Let us denote conformations of a molecule by letters A, B, etc. irrespective of how they are represented numerically. The target function, which in our case is the QM potential energy, is written as a linear combination of basis functions: where K is a positive definite similarity function between two conformations, often called a kernel function, which customarily takes the value 1 for identical conformations and smaller values for pairs of conformations that are less similar to one another, and x are the unknown coefficients. The sum ranges over a set M of representative conformations. For finite M , the basis is not complete, but by choosing the set appropriately (typically by drawing conformations from the same or a related distribution corresponding to where we expect to evaluate the function) the basis set is made relevant, and by enlarging the representative set, the approximation error can be decreased. This manner in which the basis set is adapted to the data is the principal way by which the problem of high dimensionality is circumvented. The success of this type of fitting then depends entirely on the regularity properties (colloquially, smoothness) of the target function.
The approximation can be significantly improved by choosing a numerical representation of conformations and a kernel function that respect basic physical symmetries of the potential energy of a molecule. These are translation, global rotation, and permutation symmetries.
The first two apply to any physical system, and we factor them out of the representation by transforming the set of Cartesian coordinates into the vector of all interatomic distances, Note how the dimensionality of this representation scales with the square of the number of atoms, n, but this is of little consequence, since all our samples will lie on the 3n dimensional manifold. Alternatively, one could work with the well-known internal coordinates of the z-matrix, and this choice would not increase the dimensionality.
However, the potential energy function is clearly a much less regular function of the internal coordinates, because changing some angles would correspond to much more drastic changes in Cartesian coordinates than others.
The complete permutation symmetry group of 3BPA has only eight elements, and so we simply sum the kernel function over the action of the group over one of its arguments, resulting in a permutationally invariant kernel, where G is the permutation group of the molecule and π is one of its elements. This technique is applicable to any representation of the molecular conformation and any base kernel K, and results in a permutationally invariant potential energy. In the present work, we use a Gaussian base kernel (often called a "squared exponential" kernel to distinguish this choice from Gaussian probability distributions) which, in terms of the interatomic distance representations, is given by: where R A i is the ith element of the vector of interatomic distances of conformation A, R B i is the corresponding element for conformation B, D is the number of elements in the representation vector, δ is an energy scale parameter and the θ i are length scale parameters (one for each dimension of the representation vector).
The coefficients in the ansatz (1) are determined by regularized least squares regression using energies and forces computed using quantum chemistry techniques on a diverse set of conformations (see below for further details). Given N conformations with n atoms in each, we have N (3n + 1) pieces of data, leading to the same number of linear equations when (1) is substituted either directly (for the energy) or by taking its derivative with respect to atomic positions (to obtain the forces). Let us collect the M unknown coefficients in (1) into a vector x, concatenate all the available data (energies e and forces f ) into the vector y = [e f ], and let L be the linear operator connecting this data vector with the energies of the input configurations, so that y = Le. Note that L consists of two blocks, the upper block is just the identity, and the lower block is the negative differential operator. With this, the regularized least squares problem is linear and can be written as: where K N M is the N × M kernel (or design) matrix, with elements given by the kernel function between the M representative configurations and all the N training configurations, and Λ is a diagonal matrix, whose elements are a set of parameters that control the relative weight of energy and force data and also the trade-off between accuracy and regularity of the fit. The solution to this linear problem is given by: where K M M refers to the M × M square matrix given by the kernel values between the representative configurations.
We note that the method of Chmiela et al. for generating potential energy surfaces of small organic molecules 65 uses the same kernel ridge regression technique with the following differences: (i) they include only gradient observables (i.e. forces) while GAP reconstructs the surface using both potential energies and forces, (ii) they use the same number of basis functions as there are data, which corresponds to M = 3N n above, (iii) their basis functions for the potential energy are derivatives of a base kernel (such as a Gaussian) with respect to atomic positions, rather than the base kernel itself, (iv) they use the inverse of interatomic distances as the arguments of the kernel. We have found no significant advantage to any of these variations, and note that (ii) would result in a larger linear problem, thus significantly increasing the computational cost. We typically find that M 3N n is sufficient.
Beyond the basic framework outlined above, we used one additional twist, inspired by the form of empirical organic force fields. There, the energy is typically separated into a larger bonded terms (i.e. terms involving up to 1-4 bonded interactions) and smaller non-

Generating Training Data
The goal of the GAP training procedure was to recreate the QM potential energy surface of 3BPA at the MP2/6-311G(2d,p) level of theory. This choice represents a compromise between accuracy and computational expense; one energy and force evaluation requires approximately 1 cpuhr, which makes it feasible to generate thousands of data points. However, accurate characterization of the multidimensional potential energy surface requires extensive sampling, which is not practical with an expensive QM method, so we used the following protocol.
First, we performed several independent molecular dynamics (MD) simulations in the gas phase using MP2 but with a smaller, cheaper basis set ( We then ran simulations with the fitted potential both in water solution and bound to leukotriene A4 hydrolase (see below). These latter simulations revealed a number of samples with very high energy when evaluated with the QM method. Therefore, 300 such configurations were added to the training set, and the potential refitted. Such iterative fitting has been used before, 52 In what follows, we refer to the two versions of our machine learned intramolecular potentials as GAP-v1 and GAP-v2.
Interfacing GAP and MCPRO where E L represents the intramolecular energy of the ligand, E R the potential energy of the receptor, including water molecules, and E RL the interaction energy between ligand and receptor. Similar to a hybrid QM/MM simulation set-up, we treat the various energetic components using different levels of theory. The protein environment is described using the OPLS-AA/M force field, 14 and water molecules using the TIP4P model. Receptorligand interactions are described using standard OPLS/CM1A Coulombic and Lennard-Jones interactions. 18,68 The intramolecular potental energy of the ligand is written in the general form: which allows us to perform standard MM simulations using the OPLS/CM1A force field (λ = 1), GAP simulations in which the ligand energetics are determined as described above (λ = 0), or any intermediate state determined by the coupling parameter λ. The latter feature allows us to employ free energy perturbation (FEP) theory to smoothly alter the ligand intramolecular energy between the GAP and OPLS/CM1A force fields, and thus to compute the free energy difference between the two states. Figure 2 shows the proposed free energy cycle used to correct the MM binding free energy. Conventional FEP studies compute the (absolute or relative) free energy required to extract the ligand from solution into the protein binding site (∆G M M ). The corrected binding free energy is given by: where ∆G A and ∆G B are the free energy differences between the GAP (λ = 0) and MM (λ = 1) models computed in water and protein environments respectively. The implementation of GAP is fully compatible with the replica exchange with solute tempering method, 17,69,70 which allows us to perform enhanced sampling of the ligand degrees of freedom, and with the JAWS algorithm, which aids hydration of the binding site in the bound simulations. We begin by examining in more detail the training data used in the construction of the GAP. As shown in Figure 1, 3BPA has three flexible dihedral angles connecting the two saturated six-membered rings. As such, the relatively large accessible conformational space poses a challenge for machine learning techniques. Figure 3 The compound bound to p38 MAP kinase adopts a conformation that is well sampled by our training data (φ 1 = 334 • , φ 2 = 204 • ). Interestingly, on the other hand, the pose in the leukotriene A4 hydrolase crystal structure is in a seemingly disallowed region of conformational space (φ 1 = 168 • , φ 2 = 116 • ). Closer inspection reveals that, in this conformation, the -NH 2 group on the aminopyridine is in unphysical close contact with the -CH 2 -linker (Figure 1(c) and Figure S2). This is, therefore, likely an artefact of the crystal structure refinement. By visual inspection, we were able to orient 3BPA within the leukotriene A4 hydrolase binding site with a conformation that is more consistent with the QM dynamics (φ 1 ∼ 270 • , φ 2 ∼ 270 • ). We therefore used this bound conformation as the starting point for our MC simulations. Next, we ran MC simulations of 3BPA in three different environments, using GAP-v1 to describe its intramolecular energetics, and the OPLS/CM1A force field to describe its interactions with the proteins and water. As discussed in the Introduction, 3BPA is capable of occupying a range of potentially environment-dependent conformations, and so it is important not only to validate the gas phase potential energy surface, but also those conformations adopted in the condensed phase. Hence, 300 configurations of 3BPA were saved from each trajectory, and its energetics recomputed at the MP2/6-311G(2d,p) level of theory in vacuum. Table 1 shows the RMS error in the GAP for 3BPA in water, and bound to the two proteins. The errors are less than 1 kcal mol −1 in water and bound to p38 MAP kinase, which is consistent with the reported accuracy of the GAP on the test set described in the Methods section. However, despite the reorientation of 3BPA in the binding pocket of leukotriene A4 hydrolase, the RMS errors in the GAP are extremely high (19 kcal mol −1 ).
This result is consistent with a lack of training data in the region of conformational space close to φ 1 = 270 • , φ 2 = 270 • (Figure 3(a)). Therefore, 300 configurations were extracted from MC simulations of 3BPA bound to leukotriene A4 hydrolase and were added to the original training set to produce the dihedral angle distribution shown in Figure 3(b).  Table 1. Now, the errors are close to 1 kcal mol −1 in all three environments. Figure 4 further reveals a very good correlation between GAP and QM intramolecular energies, although there is one significant outlier whose phenyl and pyridine rings approach too close (Figure 4, inset). Removal of this configuration from the ensemble of 3BPA in water reduces the error in the GAP still further from 1.42 kcal mol −1 to 0.93 kcal mol −1 . Further iterative training of the GAP would prevent sampling of this configuration during the MC simulations. Figure 3(c) (and Figure S3(c)) shows the distribution of dihedral angles sampled during these three MC simulations, and reveals that all areas of conformational space are now well-represented by the training data. Figure 3(d) (and Figure   S3(d)) shows the equivalent distributions using the MM force field, which appears to have a stronger preference for a single energy basin (close to φ 1 = 0 • , φ 2 = 180 • ), which contrasts with the GAP dynamics and original QM training data. Figure S4 confirms that duplicate runs with different starting conditions sample similar dihedral distributions, which indicates that conformational differences are associated with differences in the underlying potential energy surfaces rather than sampling limitations.
The structures of the protein-ligand complexes sampled during MC simulations are in good agreement with the crystal structures of 3BPA bound to p38 kinase and leukotriene A4 hydrolase. Figure 5 shows representative structures from the MC simulations overlaid on the original crystal structures (Supporting Information). Both GAP and MM retain the binding mode indicated by the crystal structure of p38 MAP kinase. As discussed, we have identified a binding mode of 3BPA to leukotriene A4 hydrolase that appears to be consistent with both QM dynamics and the observed electron density map. 2 We emphasize that the crystallographically-assigned structure of 3BPA is not physically reasonable due to severe steric clashes ( Figure S2), although alternative (and multiple) binding modes are possible.
The alternative binding mode proposed here is stable throughout the GAP simulation, which is a good indication that GAP is able to capture a range of conformations of this flexible molecule. The alternative binding mode is not stable in the MM simulation, and there is a rotation of the pyridine ring of 3BPA, which breaks the hydrogen bond between the amine group and the backbone of residue Pro374. However, in the duplicate MM run ( Figure S4), the bound conformation is stable for longer before the hydrogen bond is broken, and so longer simulations would be required to establish the equilibrium populations of these binding modes.
It should be emphasized that reproduction of the total QM energy for a flexible molecule of this size (15 heavy atoms) to an accuracy of 1 kcal mol −1 is a significant task. For comparison, we have computed the MM energies of the configurations of 3BPA extracted from the GAP-v2 MC simulations in the three different environments. Table 1 and Figure 4 summarize the accuracy of OPLS/CM1A, which is expected to be typical of standard small molecule force fields, in comparison with QM data. As expected, the MM force field is significantly less accurate than the machine learning potential. These improvements in intramolecular energetics are expected to carry over into improved thermodynamic quantities, such as binding free energies. As discussed in the Methods section, our implementation of GAP in the MCPRO software allows us to estimate corrections to protein-ligand binding free energies using free energy perturbation theory. In particular, the intramolecular energetics of the ligand were smoothly altered from GAP to OPLS/CM1A and the free energy cycle shown in Figure 2 was employed to compute the correction to the binding free energy, ∆G A − ∆G B (eq 8). Note that we have not computed the absolute binding free energies here. Focussing first on the binding of 3BPA to p38 MAP kinase, both GAP-v1 and GAP-v2 give a correction to the MM binding free energy of close to 1 kcal mol −1 . That is, we expect the standard MM force field to over-estimate binding, in this case, due to inaccuracies in the treatment of intramolecular energetics of the ligand in water and in the protein binding site. It is reassuring that the two versions of GAP agree on the magnitude of the correction; one would not expect the addition of extra training configurations to substantially affect the energetics of the molecule in either water or the p38 kinase binding site. The correction to the binding of 3BPA to leukotriene A4 hydrolase is larger, which is consistent with the inability of the MM force field to produce a binding mode that is consistent with the experimental electron density map ( Figure 5).

Discussion and Conclusions
In this paper, we have reported the first construction, training and application of the Gaussian approximation potential to an organic molecule with significant conformational flexibility. The potential is a full dimensional fit of the molecular potential energy surface, with squared-exponential basis functions corresponding to conformations from a MD run.
The potential can be systematically improved by adding more training data (energies and gradients) and more basis functions. Iterative training was used, whereby further sampled configurations are collected from a run with a previous version of the potential.
The GAP was trained using just 3300 (MP2/6-311G(2d,p)) QM calculations in total. In comparison, computation of the thermodynamic quantities reported in this paper required around 10 8 evaluations of the ligand intramolecular energetics, which would have been infeasible using even a very inaccurate QM/MM approach. Interestingly, using the FEP set-up described here (Supporting Information), GAP is only a factor of three times slower than the OPLS force field. There are two reasons for this. First, the speed of machine learning potentials are usually compared with evaluation of force field terms for small molecules, whereas in the condensed phase, long-ranged electrostatic interactions become a significant computational overhead. Second, the ligand is not moved at every Monte Carlo step in the condensed phase, so evaluation of the intramolecular energetics can be skipped when not required. The second version of the GAP is able to reproduce QM energies to a high accuracy of close to 1 kcal mol −1 following training on a gas phase QM MD data set, supplemented by configurations of the ligand taken from the binding site of leukotriene A4 hydrolase. We envisage iterative fitting approaches such as this being a key feature of future machine learning potentials to fill any gaps in the training data, especially if corrections can be automated and made on-the-fly. It is encouraging that substantial improvements could be made to version 2 of the GAP with only 300 extra training configurations and minimal changes to its behavior in the water and p38 kinase environments.
We have chosen to demonstrate the application of the GAP to the computation of corrections to the MM binding free energy of 3BPA to two proteins. The GAP is used to describe the intramolecular energetics of the ligand only. It should be emphasized that there are still inaccuracies in protein dynamics and protein-ligand interactions due to the use of standard MM force fields for these components of the total energy. However, a wide range of parallel work is being devoted to deriving these components from QM data, either within the confines of the MM functional form 72,73 or using expanded machine learning data sets. 42,44 By making use of free energy perturbation theory, we estimate the corrections to MM binding free energies to be close to 1 and 2 kcal mol −1 for p38 kinase and leukotriene A4 hydrolase, respectively. For comparison, a recent study of 138 experimentally-verified FEP predictions of relative free energies of binding found that the accuracy of the computed results is close to 1 kcal mol −1 . The computation of absolute binding free energies is expected to be less accurate than relative free energies, nevertheless it appears that substantial accuracy gains are achievable by improving the description of intramolecular energetics. Of course, in computer-aided drug design one is typically interested in the relative binding free energies of a congeneric series of ligands, and similar free energy cycles could be employed also in these applications. We note in this regard that a full evaluation of the accuracy of GAP for correcting protein-ligand binding free energies will require evaluation of a significantly larger validation set for which accurate experimental data are available (only approximate binding affinities of the 3BPA fragment are available 1,2 ). Our goal is first to optimize the balance between accuracy and the size of the training dataset (which affects the computational time required to evaluate the potential). This is expected to be even more crucial as we move to drug-like compounds that are typically even larger and more flexible than 3BPA. A sig-

S1 Computational Methods
The initial protein-ligand coordinates were constructed from the 1W7H 1 and 3FTY 2 PDB files. For p38 kinase, the 213 residues closest to the ligand were retained. For leukotriene A4 hydrolase, 309 protein residues were retained, and the configuration of the ligand was adjusted by inspection with reference to the QM data as explained in the main text. Monte Carlo simulations were performed using the MCPRO software package. 3 In both cases, all protein residues within 12.5Å of the ligand were free to move, and backbone motions were controlled using the concerted rotation algorithm. 4 The protein-ligand complexes were solvated in 25Å water caps and the JAWS algorithm 5 was run to determine initial water molecule distributions around the ligand. For the simulations of 3BPA in water, the molecule was solvated in a 25Å water cap.
Free energy perturbation (FEP) calculations employed 11 λ windows of simple overlap sampling 6 to perturb from the GAP to the OPLS/CM1A potentials as described in the main text. At each λ window, four replicas of the system were run in parallel with replica exchange with solute tempering (REST) enhanced sampling applied to the ligand.  Figure S1 shows the convergence of the relative binding free energy correction with respect to the number of MC steps. In the REST approach, high temperature replicas of the system facilitate crossing of any high energy barriers to sampling, and replica exchange ensures correct Boltzmann sampling in the room temperature replica. REST temperature scaling factors were chosen to be exponentially distributed (25,86,160, 250 o C), and exchange attempts were made every 10 000 MC steps.
Each REST replica runs on a separate cpu, and each λ window requires around 60 hrs using GAP-v2, compared to around 20 hrs for OPLS. Note that ligand Monte Carlo moves are not attempted at every step, which is why the GAP simulation is only a factor of three slower using the current set up.
Representative snapshots from the Monte Carlo simulations were generated using the    download file view on ChemRxiv manuscript-cole-SI.pdf (441. 16 KiB)