R. E.
Skyner†
a,
J. L.
McDonagh†
a,
C. R.
Groom
b,
T.
van Mourik
a and
J. B. O.
Mitchell
*a
aSchool of Chemistry, University of St Andrews, Purdie Building, North Haugh, St Andrews, Fife KY16 9ST, UK. E-mail: jbom@st-andrews.ac.uk
bCambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, UK
First published on 23rd January 2015
Over the past decade, pharmaceutical companies have seen a decline in the number of drug candidates successfully passing through clinical trials, though billions are still spent on drug development. Poor aqueous solubility leads to low bio-availability, reducing pharmaceutical effectiveness. The human cost of inefficient drug candidate testing is of great medical concern, with fewer drugs making it to the production line, slowing the development of new treatments. In biochemistry and biophysics, water mediated reactions and interactions within active sites and protein pockets are an active area of research, in which methods for modelling solvated systems are continually pushed to their limits. Here, we discuss a multitude of methods aimed towards solvent modelling and solubility prediction, aiming to inform the reader of the options available, and outlining the various advantages and disadvantages of each approach.
Accurate and timely prediction of solubility could save time and money in drug development, agrochemical development and environmental monitoring. An early-stage analysis of drug and agrochemical candidates allows organisations to focus on those molecules most likely to meet their required solubility criteria. Many models exist in this area, with differing levels of accuracy, physical interpretability, and calculation time.
Quantitative Structure Activity Relationships (QSAR) and Quantitative Structure Property Relationships (QSPR) are very successful in this field, providing good predictive results at a reasonably low computational cost. These models, however, tend to be limited to molecules similar to those used in their training set. Moreover, these models lack a full physical interpretation, although some do allow assessments of descriptor importance that can perhaps to some extent be physically interpreted.
Several fitted or derived general equations, which take only a few pieces of empirical data as arguments, have also been produced. One of the most successful is the General Solubility Equation (GSE),5 taking the melting point and the base ten logarithm of the partition coefficient (logP; partition coefficient for neutral molecules in octanol and water) as empirical input.
The field has also seen the revival of old ideas as new automated data driven design protocols, such as Matched Molecular Pair Analysis (MMPA).6 MMPA allows one to acquire previously ‘unknown’ data from existing data sets by exploring how a single molecular change can impact a particular property or activity of interest. We now see large scale data mining following these kinds of protocols, consortia such as SALT MINER, and programs developed by individual companies such as GSK's BioDig.7,8
In addition to these approaches, we see physics based models ranging from classical simulations to quantum chemical calculations being applied to solubility prediction. These methods vary greatly in complexity. Classical simulations can encompass simple Molecular Dynamics (MD), studying the interactions between solute and solvent, to more complex perturbations of solutes from the solution phase to the gas phase. Recent advances have seen a new generation of polarisable force fields emerging with a greater capacity to account for changes in the electronic charge distribution. Many of these force fields utilise multipole moments, as opposed to point charges, to capture the anisotropy of the charge distribution. Force fields such as Atomic Multipole Optimised Energetics for Biomolecular Applications (AMOEBA) have been used to study the solvation dynamics of ions.9 Newer, polarisable force fields, such as the quantum chemical topology force field (QCTFF), use multipolar electrostatics calculated based on quantum chemical topology, supplemented with machine learning (Kriging) to model the system. This force field has been used to model amino acids with small water clusters.10 Some force fields can be mixed with a quantum chemical core region in mixed Quantum Mechanics–Molecular Mechanics (QM/MM) approaches.
Other common models include those representing the solvent as a continuous field with no explicit solvent coordinates. In most cases, these models come at much higher computational cost than their informatics counterparts, and often at lower accuracy. However, if such a method were feasible and accurate enough to predict solubility, it would not have a domain of applicability restricted by the molecules within a training set and would also be physically interpretable. Thus, there is a continuing search for such physical methods. These methods have proven useful for modelling or approximating the solution phase, hence their applications are diverse and widespread outside of solubility prediction.
![]() | (1) |
As solubility is a thermodynamic term, it is inherently affected by factors such as temperature and pressure, as well as ionisation, solid state effects, and gaseous partial pressure for solvated gases.
pH is considered to have a significant effect on solubility, as many organic molecules can behave as weak acids or weak bases, due to ionisable basic or acidic functional groups, with polarisation of ionisable groups in solution increasing or decreasing the overall solubility. The pH of the aqueous solution in which such molecules are dissolved determines whether the molecule exists in its neutral or ionised form. The charged form of a molecule is more soluble, and thus the aqueous solubility of a substance is pH-dependent.12 This dependence is described by the Henderson–Hasselbalch (HH) equations as follows;
![]() | (2) |
Intermolecular interaction strengths play an important role in the solvation of substances from the solid state. Solutes which exhibit weak intermolecular forces (i.e. are weakly bound) tend to have a higher solubility, as the energy cost of breaking up the lattice is lower. Polymorphic effects can also lead to complications in solubility prediction. A classically cited example of this is the case of the anti-HIV drug Ritonavir,16,17 in which a polymorphic shift led to a significant change in solubility, leaving the drug with a greatly reduced bio-availability. This exemplifies the consideration of solubility as a property which is dependent upon solid, solute, solvent, and solution state properties and interactions.
Two common approaches to the calculation of the Gibbs free energy of solution utilise a thermodynamic cycle approach. A first approach calculates the free energy of solution by addition of the free energy of sublimation (taking the molecule in the crystalline phase and subliming it into the gaseous phase) and free energy of solvation (taking the molecule in its gaseous phase and solvating it into aqueous solution). An example of this approach is shown in Section 5 of this review, and other examples are also cited within the literature.14,18,19 A second approach involves calculation of the free energy of solution by addition of the free energy of fusion (taking a molecule from the crystalline state to a hypothetical supercooled liquid) and the free energy of transfer (transfer from a supercooled liquid into aqueous solution). This method is widely cited within the literature, and common GSE methods are also derived from this approach.5 Both thermodynamic cycle approaches are depicted in Fig. 1.
The solid state is an important consideration for the initial crystalline phase calculated within thermodynamic cycle approaches. Lattice minimisation calculations and periodic DFT provide excellent tools for modelling these systems. Recent advances in these methods show promise for improving predictions, these include updated codes and improved dispersion corrections in periodic DFT.20,21
Complete polymorphic screening and prediction still eludes our capabilities and hence hampers our ability to predict solubility from purely first principles.
A further consideration is that of the standard states used in the different physical states. Typically sublimation data is reported in a 1 atmosphere standard state. Solvation is typically quoted in the Ben-Naim standard state of 1 mol L−1 with a fixed centre of mass. The difference between the two standard states is a constant 1.89 kcal mol−1 (7.91 kJ mol−1), calculated as ΔGatm → mol L−1 = RTln(24.46), where 24.46 is the molar volume at ambient conditions.
The free energy of solution can be calculated directly by the following formula:
![]() | (3) |
A convenient formula19 allows the solution free energy to be calculated using the native standard states, and removes the dependence on the crystalline molar volume:
![]() | (4) |
![]() | (5) |
SVM training algorithms are built up of binary categorised data, whereby a particular data point belongs to one of two categories. Thus, the test set data is also categorised, producing a clear separation, which should be as wide as possible, in the feature space. Alternatively, in the case of regression, the surface behaves analogously to a regression line, providing a maximal explanation of the data within the bounds of an acceptable error margin whilst attempting to remain relatively flat to avoid overfitting.22,23
log![]() ![]() | (6) |
log![]() ![]() | (7) |
GSE is a simple QSPR model, with powerful predictive ability (coefficient of determination (r2) = 0.96 and root mean squared error (RMSE) = 0.53 logS units for a data set of 1026 organic molecules29), and the simplicity of the model means it has found wide application in the pharmaceutical industry. However, the reliance of the GSE on experimentally determined descriptors limits its applicability, and datasets sparsely populated at their limits can lead to overestimation of the model's predictive power.30
Ali et al.30 have revisited the GSE and have attempted to relieve the reliance of the GSE on the experimentally determined melting point by replacing it with a descriptor that describes the topological polar surface area (TPSA). They demonstrate the effects of inflated predictive power of the GSE by using a subset of an initial dataset, which reduced the overall predictive power of the GSE by approximately 6.4%. TPSA was included in a revised model to account for the fact that 88.5% of poorly performing compounds contained polarisable groups. The pure GSE model employed provided r2 = 0.818, and the TPSA replacement of melting point model provided r2 = 0.813, showing a comparable effectiveness. The number of compounds containing polarisable groups with logS predicted within ±1 log unit of experimentally determined values was also higher for the revised TPSA model (83.2% TPSA; 79.6% GSE). A final model combining melting point, log
P and TPSA was also tested, and was found to have a better predictive power than both of the previously employed models (r2 = 0.869) with 90.8% of compounds containing polarisable groups predicted within ±1 log unit of experimentally determined values.
The work of Ali et al.30 highlights the importance of reliable descriptors in improving the overall performance of QSPR models, particularly when polar or polarisable functionality is included in test sets, and when experimentally determined values are required. As such, experimentally determined values may be best suited only for comparative analysis of predictive models to experimental data as a measure of performance in many cases.
Another recent approach, by Lusci et al.,27 applies deep learning to the solubility prediction problem. The deep learning method is based on recursive neural networks adapted for undirected graph representations of molecules. The method produces good predictions of solubility on a number of standard datasets in the field.27
A further example of a cheminformatics approach is demonstrated by Shayanfar et al.31 who apply a simple QSPR model to the prediction of aqueous solubility of drugs, validated by cross-validation. A training set of 220 drug-like molecules was used to build a model with MLR. Six descriptors (solute, melting point, experimental logP, calculated Abraham solvation parameters, calculated C
log
P values and calculated melting points) were regressed against experimental aqueous solubility from the literature to develop a three-variable model, calculating aqueous solubility from Abraham solvation parameters, C
log
P and melting points. The three variable model was then tested with cross validation, and a final two-variable model was developed with the excess molar fraction of the compound E and C
log
P. The two variables used gave an R2 value of 0.934 and a standard error estimate s of 0.893. The proposed model was compared to a GSE model and a linear/solvation/energy relationship (LSER) model. Correlations between each model's computationally determined values of aqueous solubility with corresponding experimental values gave an R2 = 0.62 for GSE, R2 = 0.57 for LSER and R2 = 0.66 for the proposed MLR method.
Recent work has also suggested that, contrary to popular arguments, the quality of the experimental data available is not the limiting factor for the predictive accuracy of solubility predictions obtained from cheminformatics models.32 This work may suggest that inherent limitations within the models are responsible for the largest part predictive errors.
A simplification of continuum models can be thought of in terms of a Hamiltonian as;
Ĥtot(rM) = ĤM(rM) + ĤMS(rM) | (8) |
In a standard continuum model, generally represented by Polarisable Continuum Models (PCM), solute–solvent interaction energies can be represented by a number of Qx operators. The free energy of M is therefore described by an expression of five terms;
G(M) = Gcav + Gel + Gdis + Grep + Gtm | (9) |
In a solvent–solute system where atom Q (solute) has a positive charge, solvent water molecules will preferentially orientate their negative dipoles towards the solute's positive charge (Fig. 3, left). For a single water molecule, there is only a slight preference in orientation, which is smaller than that of its average thermal fluctuations. Therefore, this effect is averaged over the long range of electrostatic interactions of water in the bulk (Fig. 3, right). For an isotropic solvent with random thermal motion, the average electric field is zero at any given point. However, introduction of a solute gives a net change in orientation, introducing an overall change in electric field, known as the ‘reaction field’.
Accounting for the reaction field increases the solute's polarity proportionally to the solute polarisability, and the strength of the external electric field. This causes an increase in the dipole moment of Q, consequently polarising and increasing the change in orientation of the solvent to oppose the dipole moment of Q.3
There are energy costs associated with both the orientation and polarisation of the solvent, and the dipole moment of Q. As solvent molecules oppose the dipole moment of Q, they interact unfavourably with the reaction field. They also lose configurational freedom, with an associated free-energy cost. In a continuum model, the charge distribution of a solvent is represented as a continuous electric field, statistically averaged over all degrees of freedom at thermodynamic equilibrium. The electric field at any given point is the gradient of the electrostatic potential. The work required to create the charge distribution is determined from the interaction of solute charge density ρ with the electrostatic potential ϕ from;
![]() | (10) |
The polarisation component of G, which we call GP, is the difference between charging the system in gas and solution phases; thus only the electrostatic potentials in both gas and solution phases are needed to calculate GP.
PCM methods are generally applied through two models: the Poisson–Boltzmann (PB) model, and the Generalised Born (GB) model. Both models are advantageous for different systems, and the accuracy of either model is mostly dependent upon the suitability of the cavity type used to surround the solute molecule within an ideal solvent system.
![]() | (11) |
Continuum solvation models represent the charge distribution on the basis of two separate areas: inside (solute) and outside (solvent) of a cavity (Fig. 4). For this case, the Poisson equation states;
∇ε(r)·∇ϕ(r) = −4πρ(r) | (12) |
The Poisson equation as expressed above is valid only for systems under non-ionic conditions. In a real solution, dissolving a solute produces mobile electrolytes. This effect is accounted for by an expansion of the Poisson equation, known as the Poisson–Boltzmann (PB) equation;
![]() | (13) |
PB equations are best used to calculate the electrostatic potential of systems where the cavitation of solute is near-spherical or ellipsoidal (ideal cavitation), as the convergence of the predicted electrostatic component of the solvation free energy ΔGE is computationally expensive and often inaccurate. Thus, derivations applying approximations of the Poisson equation are often used in continuum models,33 the most common of which are Self-Consistent Reaction Field (SCRF) models, such as the Onsager model.34
A further limitation of PB based models is the definition of cavitation. A number of variational SCRF models have been proposed in order to optimise cavitation parameters, most commonly using tessellation (tiling) of the cavity surface to simplify and reduce iterations of the PB equation.33
A conducting sphere with charge q can be considered representative of a monatomic ion. If the surface of the sphere is assumed to be entirely smooth, the charge distribution around it will be uniform, and the charge density at any point is given by;
![]() | (14) |
![]() | (15) |
![]() | (16) |
![]() | (17) |
The experimental data from Pierotti's work has been complemented by simulation data,36 including free energy of formation data of molecular-sized cavities in 12 common solvents obtained from free energy perturbation simulations. Pierotti's formula has since been expanded for molecular cavities by Colominas et al.37
A further, specific contributing factor to solvation free energy is dispersion. A somewhat simplistic explanation of dispersion is as follows. The average electron cloud of an atom is spherically symmetrical, but at any instantaneous time point there may be a polarisation of charge causing an instantaneous dipole moment. This dipole moment interacts with neighbouring atoms, inducing a second instantaneous dipole, and so on, and an interaction occurs between these. The in-phase correlation of instantaneous and induced dipoles mean the overall interaction energy does not average to zero over time.3 The average interaction energy falls off (largely) proportionally to r−6 (where r is the distance between interacting particles). The multipole expansion of the dispersion interaction is written;
![]() | (18) |
A further notable feature of continuum models based on surface tension is the neglect of any other contribution; that is, the development of these models assumes surface area as the sole determinant of solvation free energy, and that electrostatic components are implicit within the calculation parameters used.33
Integral Equation Formalism PCM (IEFPCM) is the current version of PCM applied in common quantum chemistry packages. IEFPCM is a reformulation of dielectric PCM (DPCM) in terms of the integral equation formalism. One of the biggest challenges to PCM methods is that they are all derived assuming the solute charge density is entirely encapsulated in the cavity. This is often not the case, as the electron distributions often extend beyond the cavity. IEFPCM has been shown to cope well with this effect when compared to other PCM based methods.33
A further variation of PCM is the conductor-like polarisable continuum model (CPCM), which is often considered one of the most successful solvation models.39 The conductor-like screening model/conductor-like screening model for real solvents (COSMO/COSMO-RS)40 is a variation on Poisson–Boltzmann PCM and CPCM. In COSMO the dielectric permittivity (ε) is set to infinity (ε = ∞). This defines the solvent as a conductor, which is suggested as a more realistic approximation for strong dielectric media such as water, with the first version of COSMO40 having values of the dielectric constant with a relative error of less than ½ε−1. COSMO has been shown to be a reliable and readily available method for calculations on the liquid and solution phases. The use of a boundary condition for the calculation of total potential in place of a traditional dielectric boundary condition for the electric field found values within 10% of the exact results obtained from dielectric boundary condition methods.41 COSMO-RS extends the COSMO code to also define the ability of the solvent to screen the surface charge on the cavity of the solute. Parametrisation of COSMO and COSMO-RS performed by the software developers tested 217 small to medium neutral molecules, spanning a vast functionality of H, C, N, O and Cl. An overall accuracy of 0.4 (rms) kcal mol−1 for chemical potential differences was achieved.41
A recent addition is the solvation model based on density (SMD). This model applies the IEFPCM protocol, solving the non-homogeneous Poisson equation using a set of optimised atomic Coulomb radii. The non-electrostatic contributions are calculated on the basis of a parameterised function which includes terms for atomic and molecular surface tensions as well as the solvent accessible surface area.42
A recent investigation of gas to solution phase standard state Gibbs free energies of solution compares energies obtained for six combustion gas flue compounds at the G4 level of theory using IEFPCM, CPCM and SMD implicit solvent models for 178 organic solvents. It is found that IEFPCM and CPCM produce similar ΔGS values for all six flue compounds, with maximum absolute intra-solvent deviations of <1.6 kJ mol−1. Intra-solvent deviations between the IEFPCM and SMD models up to 45.5 kJ mol−1 were observed. IEFPCM and CPCM also showed strong correlation between calculated solvent ε and ΔGS for all solvents, whereas SMD showed a much more varied relationship.43
Statistical thermodynamics relates all observable thermodynamic properties to the partition function, Q. The partition function is summarised as;
![]() | (19) |
Explicit models consider solvation in terms of free energy calculations, with different models for water available, as discussed below.
![]() | (20) |
More recent derivations of Zwanzig's model allow the division of perturbations into smaller calculations, allowing parallelisation. These models involve breaking the reaction pathway down into a series of intermediate transition state steps, allowing better convergence between the initial and final structures investigated.47 However, FEP calculations remain one of the most computationally expensive methods for calculating free energy differences.
An example of this is shown by Lüder et al.48 who have investigated the effectiveness of FEP methods for the calculation of free energy of solvation in pure melts for 46 drug molecules. Simulations were performed in two stages, scaling down the Coulomb and Lennard-Jones (LJ) interactions independently. Results were interpreted under the assumption that the free energy of the vapour to liquid process ΔGvl can be calculated from the sum of the free energy term for cavitation ΔGcav and the energy associated with LJ interactions and half of the Coulomb interaction term. ΔGcav is obtained from hard-body theories. Interaction energies and molar volumes for each of the 64 drug molecules were compared for systems comprising 260 molecules. Deviations between systems were found to be an average of 2.9% for intermolecular interaction energy, and 1.4% for molar volume, suggesting the dataset selected would provide reliable results. Predicted and simulated ΔGcav values are found to be systematically underestimated by approximately 15%. An overall average deviation of calculated ΔGvl values in comparison to experiment is −1.8 kJ mol−1, with reasonable errors expected in the range −1 to 1 kJ mol−1. This investigation suggests that overall, FEP methods require more work at the theory level, particularly due to systematic errors that occur in phase space relationships between reference and perturbed systems.
An alternative approach to calculating the free energy difference from one state to another is to treat the change from A to B as a transformation, rather than to calculate free energies of independent structures, and calculate an energetic difference, as in traditional FEP methods.3
A recent application of this method, derived from FEP, has been demonstrated by Liu et al.49 for the calculation of the solubility of gases in ionic liquids. The Bennett acceptance ratio (BAR) method utilises the method of transferring between states instead of treating each state as an individual structure. The Coulomb and LJ terms are calculated separately. It is found that simulated solubilities are found in good agreement with Henry's law constants. However, comparison to experimental data finds poorly soluble gases to have larger errors, with underestimated and overestimated gas solubilities found with similar calculation methods in complementary studies.
Wyczalkowski et al.50 recently proposed two new methods for the estimation of entropy and enthalpy decomposition of free energy calculations, evaluated for the solvation of N-methylacetamide (NMA). The methods investigated found thermodynamic contributions to be in disagreement with experimental data, highlighting the difficulty in obtaining decompositions comparable in quality to free energy estimates, with thermodynamic decomposition of computational Helmholtz free energies of solvation (ΔF at fixed volume) values yielding errors approximately two orders of magnitude larger than the initial ΔF values found. It is noted that ΔF values are statistically reliable and can be used for quantitative comparison to experimental data. The calculation of entropic and enthalpic contributions is also extremely computationally demanding, as every temperature point of a simulation requires recalculation of the overall free energy.3 The authors highlight that where calculation of free energies of solvation has advanced so that computational errors are on par with experimental ones, thermodynamic decomposition calculations suffer from statistical errors 10–100 times larger than free energy of solvation calculations.
A recent study by Ahmed and Sandler51 uses the decomposition of free energies of hydration and self-solvation of low polarity nitrotoluenes to consider an array of thermodynamic terms and physiochemical properties. These include: solid-phase vapour pressures, solubilities, Henry's law constants, hydration and self-solvation entropies, enthalpies, heat capacities and enthalpies of vaporisation or sublimation. Their study focuses on the temperature-dependence of various terms. Decomposition of hydration free energies into enthalpic and entropic contributions is performed by a method utilising polynomial fitting of temperature-dependent self-solvation free energies (with respect to temperature). The use of fitting increases the sensitivity of derived values of hydration free energies. Self-solvation enthalpy (ΔHself) values and entropy (TΔSself) values are calculated within approximately 2 kcal mol−1 of experimentally determined values.
The foundational concepts involve the partitioning of a desired system into two subsystems: the QM subsystem, containing a small number of atoms and described by QM, with the remainder of the system described by a suitable MM force field. The Hamiltonian of the whole system is simply written;
H = HQM + HMM + HQM/MM | (21) |
Theoretically, any desired level of accuracy can be used within the QM region of the simulated system, within the scope of available methods. However, more accurate methods are susceptible to high computational cost. Thus, careful consideration is required by the user as to what level of accuracy is required, and at what cost. A succinct overview of different available QM methods is provided by Friesner and Guallar52 for QM/MM methods applied to enzymatic catalysis, with descriptions, advantages and disadvantages of respective QM methods available in textbooks such as the one by Cramer.3
A primary consideration when selecting a QM/MM method is the interactions at the QM/MM interface. Two aspects must be considered; (i) the presence of covalent bonds across the interface – a particular concern for large (e.g., biomolecular) molecules, (ii) the influence of the MM solvent region on the QM region – electrostatic and van der Waals interaction terms must be included.
In order to treat covalent bonds at the interface, it is possible to introduce “link atoms”. Link atoms are QM hydrogen atoms that fill free valencies of QM atoms connected to MM atoms. A disadvantage of this method is the debate about inclusion of Coulombic interaction terms for the link atoms. Other methods developed in order to avoid the use of link atoms include the Local Self-Consistent Field (LSCF) method, which applies a mixture of hybrid and atomic orbitals to represent the QM system, and the “connection atom” method, where MM and QM interface atoms are described as QM methyl groups with a free sp3 valence.
A recent three-layered approach aiming to tackle the issues associated with the QM/MM interface and the interaction terms for MM solvent effects has been proposed by Steindal et al.53 This approach is described as the fully polarisable QM/MM/PCM method (see Section 3 for a description of PCM), and is designed for the effective inclusion of a medium in a QM calculation. Short range solvent electrostatic potentials are described by an atomistic model (QM/MM) whilst the long range potentials are described by a continuum. The method is implemented in combination with linear response techniques with a non-equilibrium formulation of environmental response. The authors find a faster convergence with respect to system size for QM/MM/PCM than for QM/MM methods. This approach allows for reduction of the MM part of the calculation with PCM, allowing less demanding calculations, and reduced sampling. However, three-layered approaches such as this often require much more user input and method manipulation, for example, considerations for MM/PCM interactions have to be considered in addition to QM/MM interactions, and so such methods are suited only to advanced users.
The treatment of water can be rigid or flexible. Rigid models often include a fictitious H–H bond to constrain bond angles in the water monomer.3 Three of the most common rigid models for water are the TIP3P (transferable intermolecular potential 3P), SPC (simple point charge) and SPC/E (simple point charge extended) models, and their modified counterparts. These three models are effectively rigid pair potentials comprising LJ and Coulombic terms. However, the terms used differ in each model, and give rise to different calculated bulk properties for water.54 Values for various properties of water obtained with different rigid models of water are shown below, in Table 1.
MD calculations require the integration of Newton's equations of motion for all atoms, which is achieved through the evaluation of all atomic forces at each time step. Non-bonded interactions, especially long-range electrostatic interactions, dominate computationally, requiring extensive CPU time. In order to minimise this to an acceptable level, approximations are necessary. Boundaries are introduced into water models to restrain the system to a finite size, which almost always leads to artefacts in the obtainable data.54 The most commonly utilised method for cost-effective solute computations is the application of a spherical cut-off, limiting the number of pairwise interactions to those within a specified radius.54 The use of cut-offs for non-bonded interactions can have undesirable effects. LJ interactions are susceptible to small energetic effects, and large pressure effects induced by cut-offs. Pressure scaling can be used to correct for pressure related cut-off effects, usually to the order of several hundred bar. Cut-off effects for systems with dipolar electrostatic interactions are more prominent, with cut-offs selected within the parameters of experimental radial distribution functions up to ∼1.0 nm. However, computer simulations have shown ordering within water up to ∼1.4 nm, so the full structure of water is not typically accounted for, resulting in a poor description of dielectric properties. A further, and the most prominent, effect of cut-offs occurs in systems with full charges, where accumulation of the charge occurs at the cut-off boundary.59
Spoel et al.59 (1998) investigated the effectiveness of TIP3P, TIP4P, SPC, and SPC/E models in describing the density and energy, dynamic, dielectric and structural properties of water. All simulations and analyses were identical for each model investigated, allowing the evaluation of simulation methodology independent of the model. It was found that system size, cut-off length and reaction fields had comparable effects on the overall calculated structural properties of water.
System size effects are considered through the comparison of systems comprising a small (216) and a large (820) number of molecules. The average thermodynamic properties (ρ, Epot, T, P) are the same regardless of system size. Fluctuations in thermodynamic properties are known to be proportional to the square root of the system size, which is confirmed within the study. However, differences between large and small systems are observed, particularly for the dielectric constant, which is higher for all systems with a large number of molecules. The diffusion constant for large systems is also higher, attributed to periodic boundary conditions (PBC).
Cut-off effects are considered by the use of two different cut-off lengths (0.9 nm and 1.2 nm) for the large systems. It is found that density increases with an increased cut-off length, and energy decreases. There is no effect on dielectric behaviour.
In all simulations density decreased by approximately 1 kJ mol−1 on application of a reaction field. The self-diffusion constant D, and rotational correlation times were found to increase, indicating that the reaction field affects both the translational and rotational mobility of molecules.
Quantum chemical MD simulations of water are often developed with Density Functional Theory (DFT) methods, using either plane wave or atom-centred basis sets, to determine the electronic structure and forces. These methods offer reasonable estimates of the structural and dynamic properties of water when compared to experimental measurements. However, problems exist in the description of electronic gradient corrections, and equilibrium pressure. The interatomic forces of early quantum simulations, including DFT based methods, were originally parameterised with classical mechanics, leading to an unsatisfactory agreement between quantum and experimental results. DFT models also tend to calculate liquid structure with too much order, and underestimate equilibrium density. This is often attributed to the inability of local functionals to describe dispersion effects.
A recent approach to water simulation has claimed to provide a model, called the electronically coarse-grained model, capable of accounting for the shortcomings of both existing classical and quantum models.60 Jones et al.60 (2013) base their method on the replacement of valence electrons of an atom with an embedded Quantum Drude oscillator (QDO). QDO treatment of water is based upon the TIP4P classical rigid model of water, with the three water atoms supplemented by a dummy atom with a negative charge, added along the ∠HOH bisector to create an additional interaction point. The QDO parameters aim to reproduce the isotropic parts of the dipole, polarisability, and the dispersion coefficient. The dispersion interaction is then adjusted by scaling, whilst preserving polarisability. The baseline unadjusted model produces a realistic, but over-structured liquid with a density that is too low by up to 20%, attributed to its underestimation of dispersion. Note also that the value of the enthalpy of vaporisation (at ambient pressure) Δhvap was found at 40 ± 2 kJ mol−1, close to the experimental value of 43.91 kJ mol−1. Scaling the dispersion term results in an increased equilibrium density for increased dispersion. This induces a weakening effect on the H-bonding network of water, bringing the overall structure closer to agreement with benchmark data. However, the calculated Δhvap increases to 46 ± 2 kJ mol−1, which is 4% higher than the experimental value. It is also found that the H-bond network is sensitive to changing polarisation at fixed dispersion, affirming the independent importance of both polarisation and dispersion effects on an overall explicit model.
PCF can be interpreted as showing the probability against distance of there being an atom of interest at that distance from the atom under study. For example the first large blue peak in Fig. 6 would correspond to either a water H at a distance from an O atom under study or vice versa. These functions are experimentally determinable from scattering experiments. We would expect that the PCF/RDF would go to a constant value of 1 at large values of r (i.e. it would become isotropic, like a continuum model, as there are no solute interactions to perturb the system). However, at small values of r we would not expect this. At very small values (less than the van der Waals radii of the solute atoms) we expect zero as only one particle can occupy the space at a time. Just outside this distance we see sharp non-uniform behaviour as solvent in the space interacts favourably with the solute holding a more rigid form. This leads to troughs in the PCF/RDF just behind the peaks, thus deviating from the value of 1 for a uniform solvent (Fig. 6).
![]() | ||
Fig. 6 A schematic representation of PCF for liquid water; water oxygen – water hydrogen (blue) and water oxygen – water oxygen (red). |
hij(r1 − r2, Θ1 − Θ2) = gij(r1 − r2, Θ1 − Θ2) − 1 | (22) |
![]() | (23) |
![]() | ||
Fig. 7 Illustration of the contributions, both direct and indirect, to the total correlation function. |
To solve this equation, h(r) and c(r) need to be found. As we have only a single equation and two unknown functions, h(r) and c(r), another equation is required; a closure relation must be introduced. There are several such equations available from statistical mechanics. The exact closure relation is as follows:
g(r) = e−βU(r)+h(r)−c(r)+B(r) ⇒ e−βU(r)+T(r)+B(r) | (24) |
![]() | (25) |
h(r) = e(−βU(r)+T(r)) − 1 | (26) |
![]() | (27) |
Due to the spherical symmetry approximation, the MOZ can only be applied to simple solutions. Additionally, due to the high dimensionality of the full equation, before the spherical symmetry approximation was invoked, it was practically incomputable. For this reason a number of approximations have been developed which are collectively referred to as Reference Interaction Site Models (RISM).
The 3D derivation of RISM (3D-RISM)64,65 is a 3D molecular theory of solvation, applied through solvent distributions, rather than explicit solvent molecules, and conceives solvation structure and dynamics from the first principles of statistical mechanics.
3D-RISM is derived from a partial integration over the orientational degrees of freedom; this leaves a set of 3D integral equations (one equation per solvent site; Nsolvent). This method utilises solvent site – solute total correlation functions and direct correlation functions in the solution of the RISM equations. The 3D-RISM equations take the following form:62
![]() | (28) |
χξ,α(r) = ωsolventζγ(r) + ρα(hsolventζα(r)) | (29) |
![]() | (30) |
Implementations of both 1D- and 3D-RISM are available in well-known computational packages such as AMBER. There are also implementations in some quantum chemistry codes such as ADF.
Many studies have been conducted over the last two decades with a view to improving the accuracy of 3D-RISM for a variety of applications. Modifications to the original equations have included cavity corrections,67 parallelisation with fast Fourier transforms68 and MD modifications,63 amongst others.
The universal correction (UC)69 given in eqn (31) is a two parameter correction derived by regression. ΔGGFhydration refers to the Gaussian fluctuation hydration free energy (HFE) functional discussed below, a and b are regression coefficients (a = −3.2217 and b = 0.5783), and ρV is the dimensionless partial molar volume as calculated by 3D-RISM.
![]() | (31) |
Correction schemes for 1D-RISM also exist. These correction schemes must correct for additional approximations from the 1D RISM theory. A recent addition is the Structural Descriptor Correction (SDC). This applies QSPR methods and group contributions to correct 1D-RISM.66
A primary concern in the improvement of 3D-RISM remains its ability to describe the thermodynamic properties of solvation. One view adopted by Palmer et al.69 is that solubility calculations should be considered in terms of a simple thermodynamic cycle, calculating the solvation free energy from summation of the free energy of sublimation, and the free energy of hydration, as illustrated in Fig. 10.
A recent investigation by Palmer et al.62 implements the thermodynamic cycle approach to the calculation of solubility, with sublimation free energies calculated from crystal lattice minimisation and HFEs calculated with 3D-RISM. Crystal lattice calculations are performed on known crystal structures.
The authors highlight a plethora of existing approximate functionals which can provide HFE values from the solvent site-solute total correlation functions and direct correlations of 3D-RISM. However, the functionals investigated previously to Palmer et al.'s work often provide HFEs with RMSE errors higher than the standard deviation of experimental data, and worse than those reported in QSPR models.
The investigation62 implementing the thermodynamic cycle approach to the calculation of solubility applied the previous work of Palmer et al.61 and found that the thermodynamic cycle approach predicted HFEs in good agreement with experiment (R = 0.94, σ = 0.99 kcal mol−1). However, the predictions did not perform as well as purely empirical approaches, and this was mostly attributed to a lack of parameterisation against experimental data.
A = Esolute + Δμ | (32) |
The improvement associated with the insertion of explicit water molecules within a dielectric continuum has been demonstrated by Kelly, Cramer and Truhlar,74 who use the calculation of aqueous acid dissociation constants to demonstrate the effects of inserting a single explicit solvent molecule into a continuum solvent representation. Along with previous work,75 the authors show that in many cases an implicit solvation method is sufficient for the calculation of pKa values. However, when strong and specific solute–solvent hydrogen bonding interactions are expected to contribute significantly to the aqueous phase, a single explicit molecule inserted to the continuum significantly improves pKa calculation. Using their own implicit continuum model (SM6), it is found that addition of further explicit waters, up to three, significantly increases the accuracy of the calculation. However, the use of alternative continuum models, namely SM5.43R and PCM, finds a worsening of results when an increasing number of explicit atoms are added. This exemplifies the importance of choosing a suitable continuum representation in implicit–explicit hybrid models.
Zhu and Krilov76 discussed two flexible boundary hybrid solvation models for biomolecular systems, based upon the traditional hybrid model with both explicit and implicit solvent regions. The proposed models aim to account for short-range solvent effects via elimination of PBC by limiting the number of explicit solvent molecules to two or three solvation shells. The first model, the dynamic boundary model, imposes a confining potential on the solvent, which responds dynamically to fluctuations in solvent distribution and solute conformation. The second model, the exchange boundary solvation model, allows pairwise exchanges between the explicit and implicit regions of the system, maintaining a uniform hydration of the solute. Comparison of the two methods with traditional PBC methods shows good agreement between calculated energies, and the two models are found to improve computational efficiency by up to two orders of magnitude, attributed to the reduced number of explicit solvent molecules in comparison to other models.
Chaudhury et al.77 recently discussed the discrepancies between explicit and implicit methods for solvation models of biological systems such as proteins, and consequently investigate a Hybrid Replica Exchange Molecular Dynamics (REMD) method for protein solvation. Temperature-based REMD involves running multiple simultaneous simulations at a wide range of temperatures, while allowing temperature exchange between simulation steps. This relates the relative probability of finding each conformation at a given temperature to conformational energy. Traditional REMD successfully models small peptides and proteins, but becomes more cost-constrained for larger systems. In order to account for discrepancies between implicit and explicit methods, the authors propose a hybrid implicit–explicit method with each simulation step run exclusively in explicit solvent. During exchange between time steps, the entire solvent system is replaced with an implicit solvent model. Finally, the explicit solvent is re-inserted for the next simulation step. The use of an implicit solvent model during exchange significantly reduces computational cost. Where implicit and explicit models give different behaviours, the hybrid method gives mixed results in terms of thermodynamic and structural descriptions. However, the explicit model of solvent molecules describes solvent-specific features of energy landscapes well.
A further emerging method that similarly attempts to reduce the cost-constraints of explicit methods is Grid Cell Theory (GCT).78 GCT spatially resolves the enthalpic and entropic components of hydration on a 3D grid, covering a volume of space around a solute. The grid can be non-uniform and unevenly spaced. The solute is constrained to adopt a single conformation, speeding up convergence by only allowing rigid body translations and rotations of water molecules. A second benefit of GCT is that graphical analysis of a calculated grid is possible. A drawback of GCT method development emanates from the fact that there does not exist a unique method of partitioning a free energy into a sum of contributions, as contributions are susceptible to coupling. Gerogiokas et al.78 have recently proposed a GCT method, and evaluate the enthalpic and entropic contributions to hydration, making visualisation of hydration thermodynamics possible. GCT is a slower method than other thermodynamic integration methods, but such alternative methods are not as descriptive in terms of thermodynamic contributions.
We have placed particular emphasis on 3D-RISM and its derived counterparts, as we believe that RISM based methods are a strong contender in the challenge of finding a computationally viable solubility prediction method which is also descriptive enough for the theoretical study of a system's thermodynamics. However, it is also noted that such methods are a long way from perfection, and require further refinements of solute–solvent correlation functions.
With the increase of computing power, as described by Moore's law, it is hard to predict how much of an issue computational costs associated with solvation modelling will be over the coming years. However, increases in computing power will inevitably allow more accurate methods to be employed within a faster timeframe. We predict the emergence of hybrid models which describe the theoretical and physical components of solvation at an ever increasing rate, with the need to trade off accuracy over time becoming less as computing power increases.
Although future prospects for solvation modelling are bright, we are also aware that there is a very present need for good models. We would like to note that the best choice in model for solvation is entirely dependent on the requirements of the user. For high-throughput screening of molecules of similar structural features, we suggest QSPR/QSAR as a suitable and reliable approach for thermodynamic property calculation (e.g., solvation free energy). However, where specific physical and mechanistic meaning is desired, it is best to employ either explicit solvent representations, suitable for relatively small solute sizes, or where larger solutes are used, hybrid models. The choice of hybrid models for such investigations is not intuitively obvious, as highlighted within this review, as some systems are described sufficiently with addition of a single solute molecule, whereas for other systems it is necessary to add enough explicit solvent molecules to describe full solvation shells. Thus, it is often necessary to consider whether solvent behaviour is a significant contributor to the property of interest. If so, explicit/hybrid methods are advisable, dependent upon available computing resources. Otherwise, continuum models could offer sufficient physical description of the solvent environment. Of course, where sufficient and trustworthy experimental data are available, several models should be tested and evaluated for correlation with available experimental data.
Footnotes |
† Signifies that these authors contributed equally to this work. |
‡ A recent machine learning method and dataset proposed by some of the authors is available from the Mitchell group web server: http://chemistry.st-andrews.ac.uk/staff/jbom/group/Informatics_Solubility.html |
This journal is © the Owner Societies 2015 |