Simon
Boothroyd
a,
Andy
Kerridge
a,
Anders
Broo
b,
David
Buttar
c and
Jamshed
Anwar
*a
aChemical Theory and Computation, Department of Chemistry, Lancaster University, Lancaster LA1 4YB, UK. E-mail: j.anwar@lancaster.ac.uk
bPharmaceutical Science IMED Biotech unit, AstraZeneca, Pepparedsleden 1, Mölndal, 431 83, Sweden
cPharmaceutical Science IMED Biotech unit, AstraZeneca, Silk Road Business Park, Macclesfield, SK10 2NA, UK
First published on 19th July 2018
Solubility is a fundamental property of widespread significance. Despite its importance, its efficient and accurate prediction from first principles remains a major challenge. Here we propose a novel method to predict the solubility of molecules using a density of states (DOS) approach from classical molecular simulation. The method offers a potential route to solubility prediction for large (including drug-like) molecules over a range of temperatures and pressures, all from a modest number of simulations. The method was employed to predict the solubility of sodium chloride in water at ambient conditions, yielding a value of 3.77(5) mol kg−1. This is in close agreement with other approaches based on molecular simulation, the consensus literature value being 3.71(25) mol kg−1. The predicted solubility is about half of the experimental value, the disparity being attributed to the known limitation of the Joung–Cheatham force field model employed for NaCl. The proposed method also accurately predicted the NaCl model's solubility over the temperature range 298–373 K directly from the density of states data used to predict the ambient solubility.
There are three main approaches to solubility prediction: empirical, correlation-based methods,6 quantum mechanical (QM) continuum solvation models such as COSMO-RS,7 and molecular simulation.8 Correlation methods include quantitative structure property relationships (QSPR) based on molecular descriptors, with the parameters being optimised against a dataset of molecular structures with known solubilities. Such models are limited in their usage, breaking down when predicting solubility for molecules that are distinct from the training set. Furthermore, the solubility can only be predicted at the conditions (e.g. temperature and pressure) at which the training set data were collected. The continuum solvation approaches neglect sampling of the solvent degrees of freedom and involve parameterisation, in particular requiring a fitted value for the free energy of fusion for the prediction of solubility of solids.
Molecular simulation offers potentially the more powerful approach to solubility prediction, with the solubility being accessed via statistical mechanics. There are two distinct approaches: via calculation of the chemical potentials9 (summarised below), or direct (brute force) simulation of the dissolution of the solid in a solvent towards equilibrium.10 The latter requires large system sizes to minimise finite-size effects and very long simulations to attain the essential near equilibrium conditions.
At the solubility limit xs, the (undissolved) solid phase coexists with its solution. As the two are in equilibrium, the chemical potential of the solute in the solid phase (μsolid) and that in solution (μsoln) are identical at the given temperature T and pressure p, μsolid(T,p) = μsoln(T,p). Prediction of the solubility therefore requires in general the calculation of the chemical potential of the solute in solution for a series of concentrations, and then interpolation to find where it intersects the chemical potential of the solid (which is calculated separately). Both of these chemical potentials are accessible by molecular simulation. The chemical potential of the solid phase can be calculated via thermodynamic integration of an Einstein crystal11,12 or by quasi-harmonic lattice dynamics. Calculation of the chemical potential of the solute in solution is more demanding, though the methods are well established and include thermodynamic integration,13,14 the so-called perturbation approach,15–17 expanded ensembles,18,19 and variations on these.20 These methods involve ‘growing’ the solute molecule from its reference state reversibly in the solvent. While both thermodynamic integration (TI) and perturbation techniques are robust and effective (particularly when coupled with soft-core21 and dampening potentials22), large drug-like molecules are still challenging, and these methods are computationally very demanding. Each chemical potential determination requires at least a dozen or so separate simulations, which need to be repeated for any other temperature and pressure conditions of interest. To date there are only a few studies that have attempted to predict solubilities from molecular simulation via chemical potential calculations.19,23–28 Much of the focus of these studies has been on the alkali halides with NaCl becoming a model test case.
Here we present a novel method to calculate the solubility directly from the density of states of a system. Density of states (DOS) calculations are well established, being particularly effective and efficient for determining phase co-existence.29–31 The application of DOS methods however has been largely restricted to single, pure component systems. We utilise the DOS framework for multicomponent systems to access phase coexistence of a solid in equilibrium with its solution, and hence the solubility. The method in principle is able to predict solubility for a range of temperatures, pressures and solid forms using a single, density of states. It is more efficient than thermodynamic integration and the perturbation approach. We have successfully applied the methodology to predict the aqueous solubility of sodium chloride.
The isothermal–isobaric (NpT) partition function is given by
(1) |
(2) |
(3) |
If the density of states is known, the phase coexistence condition can be determined by exploring the probability distribution at a given pressure whilst scanning in temperature, or vice versa. The probability distribution of a single component at coexistence exhibits two peaks of equal area, indicating that both phases are equally likely under these conditions. A key feature of the DOS approach is that the density of states Ω(V,E) is independent of T and p. This means that, in principle, coexistence conditions can be determined for a range of temperatures and pressures all from a single density of states.29
We now consider a multicomponent system composed of a number of different molecular species i, j, k,…. Within this system, we allow the number of molecules of one component Ni to fluctuate, while the populations of the other components Nj,Nk,…, are kept fixed.
For such a system, the partition function is given by
(4) |
(5) |
As before, if Ω(Ni,V,E)Nj,Nk,… is known, exploration of the above probability distribution would enable coexistence conditions to be identified – including the sought-after coexistence point at which the solid phase of component i would be in equilibrium with its solution phase i.e. the solubility. Thus for a given temperature and pressure, tweaking the chemical potential for component i would yield a bimodal probability distribution as a function of number of particles Ni in the Nj,Nk,… mixture system at the solubility limit, from which the solubility concentration can be ascertained. The two coexistence states would be the 100% solute (solid) phase, and its saturated solution (Fig. 1).
We do not, however, need to determine the density of states for the whole spectrum of mole fraction values from xi = 0 (pure solvent) to xi = 1 (pure solute) as implied, though we could. Given that at the solubility limit, μsolid(T,p) = μsoln(T,p), one could substitute the chemical potential of the solid, if it were known, into the probability distribution (eqn (5)). This would guarantee that a peak is observed at xi = 1. A second peak would then be expected at some lower mole fraction, which would correspond to the solubility (Fig. 1). Thus, we can calculate the chemical potential of the solid phase separately, and therefore focus on a limited mole fraction range where the solute remains in solution; the solubility condition will reveal itself as a single peak in the probability distribution located at the corresponding concentration.
The primary challenge therefore is to access the density of states Ω(Ni,V,E)Nj,Nk,…, techniques for which are now well established.33 Here we employ a 3-dimensional variant of the efficient Monte Carlo scheme originally developed by Wang and Landau.32 Configurations are generated according to probability
(6) |
The density of states is evolved over a number of iterations, beginning with a (gross) value of lnf = 1. When the histrogram of visits h(Ni,V,E) is sufficiently flat (in our case, when the minimum value is greater than 80% of the average), the value of lnf is reduced to , and the histogram of visits is reset to zero for the next iteration.
To explore the (Ni,V,E) space associated with Ω(Ni,V,E)Nj,Nk,…, we employed Monte Carlo simulations involving particle translation, volume scaling, and solute insertion/deletion moves. The respective moves were accepted or rejected in accordance with the following criteria,27 which are valid provided that the volume is sampled logarithmically:
(7) |
As is well known, insertion/deletion moves present a particular challenge for dense systems and large solute molecules. Insertions of such molecules in dense systems are invariably rejected due to overlaps, while deletion of species with a high affinity for each other e.g. ion pairs, will often be unfavourable. Here we have devised a creative solution wherein we extend the sampled volume space for the liquid (solution) state to the gas phase for each of the Ni systems, and then proceed to carry out the particle insertion/deletion there (see Fig. 2).
The procedure to predict the solubility, therefore, comprises two distinct stages:
(i) Determination of the 2-d density of states Ω(Ni,V,E)Nj,Nk,… for each solution concentration (…, Ni − 1, Ni, Ni + 1,…), calculated (independently) in the NpT ensemble. The energy and volume ranges are chosen so that both the liquid and gas states are sampled at each particular Ni.
(ii) Determination of the density of states in the gas phase of the full assembly of multiple concentration systems (…, Ni − 1, Ni, Ni + 1,…) in an μVT ensemble (involving particle insertions and deletions) over the entire chosen concentration range, where the volume is chosen such that the number density of the system is sufficiently low that insertion/deletion moves become feasible.
As the density of states for each window is calculated to within a multiplicative constant, the individual density of states windows must be combined using a fitting procedure. This requires finding a set of offsets Cm using least squares, which minimises the error function
(8) |
This approach has significant advantages. Firstly, the insertion/deletion moves are favourable even for large solute molecules – the minimum, system number-density (maximum volume) sampled can be increased arbitrarily to accommodate this. Secondly, exploring the volume and concentration dimensions independently greatly reduces the space that must be explored. Instead of having to sample the entire, combined 3-dimensional energy, volume and concentration space (E–V–Ni), one essentially samples the 2-dimensional E–V and E–Ni spaces. Finally, to study broader temperature and pressure ranges, only the solution (liquid) portion of the windows need to be expanded (so as to cover the energies and volumes accessible to the system over the range of conditions to be studied), the rest remains constant. This significantly reduces the number of simulations that must be run when exploring temperature and pressure.
We explored two approaches for choosing the accessible volume and energy ranges for states between the liquid and gas regions, shown in Fig. 3. The first approach was to simply interpolate the accessible energies and volumes between the liquid and gas values. For the second approach, at low volumes (those accessible to the liquid) we allowed the system to explore energies ranging from the liquid values all the way to close to the gas values, essentially allowing the liquid to pass into a supercritical regime. At higher volumes, moving towards the gas volume, the system was restricted to exploring only the high energy states. This second pathway was found to give a much faster convergence of the density of states (possibly because the system navigates around the first-order gas to liquid transition), and hence was used in this study.
The energy range (for the whole system i.e. un-normalised by the number of molecules in the system) was discretised into bins of width 10000 kJ mol−1 while the logged volume range was discretised into bins of width 0.008. These values where chosen so that the curvature of the peaks in the probability distributions was sufficiently captured, which is also a good indicator that the curvature of the density of states has been sufficiently captured.
The initial value of the Wang–Landau convergence factor was set to 1.0 and was allowed to decrease to 2.384 × 10−7. By this point the relative change in the logged density of states between the current and previous iterations was sufficiently low, indicating that the density of states was converged. Further, both the chemical potentials and the probability distributions had also reached convergence by this point. It is crucial for this method that the density of states has indeed converged as small errors in the density of states can lead to large errors in the probability distribution.
The Monte Carlo code was parallelised using the scheme proposed by Vogel et al.35 to expedite convergence and precision. The (Ni,E,V) space was partitioned into small overlapping chunks with multiple walkers being assigned to each chunk. Three walkers were found to be optimal for each liquid–gas window chunk and four walkers for each gas window chunk.
Fig. 4 The probability distribution for the aqueous sodium chloride system at T = 298 K and p = 1 atm, averaged over five independent runs. |
The probability distribution reveals a dominant peak at about 13 NaCl pairs. Taking an ensemble average
(9) |
This value is actually roughly half of the experimental solubility of 6.14 mol kg−1. The disparity between the calculated and experimental solubility is due to the model itself (which is currently the best available).8 In relative terms the solubility prediction is decent given that aqueous solubilities predicted by continuum solvation methods are at best within 4-fold of experimental data and often worse. The handful of solubility predictions from molecular simulation that have been reported (including the current study) reveal the critical nature of the force field parameters. Coexistence points are known to challenge force fields but for the same reason serve as essential data points for developing and optimising force field parameter sets.
We then used the determined density of states to ascertain how the chemical potential of NaCl solutions varies as function of concentration, using two distinct approaches. Firstly, we calculated the chemical potential from the density of states for a series of NaCl concentrations by calculating the free energy as a function of concentration, to which a polynomial was fitted and then differentiated with respect to Ni. In the second approach we switched the independent–dependent variables and estimated the NaCl concentrations from probability distributions (as for NaCl solubility) corresponding to a series of chosen chemical potential values between −770.5 and −773.5 kJ mol−1. While both approaches were in good agreement, the latter approach yielded values closer to those calculated by others – the data for which are presented in Fig. 5 along with values presented in the literature for this model.8,26 As can be seen, the predicted values are in excellent agreement with the literature values, confirming that the presented DOS methodology not only offers a robust route to solubility prediction, but also enables the calculation of chemical potentials. In principle, these chemical potentials could be used to obtain the mean ionic activity coefficient of NaCl37 as a function of temperature, provided the chemical potential of NaCl at infinite dilution is also estimated.
Fig. 5 The chemical potential of the JC/SPC/E NaCl model as a function of concentration as calculated by this work (crosses), Vega et al.8 (triangles), Panagiotopoulos et al.26 (squares). |
As a further validation of the method, the solubility of the JC/SPC/E NaCl model was calculated for a range of temperatures between 298 K and 374 K, from the same density of states surface as used for the calculation at 298 K. The predicted solubility as a function of temperature is presented in Fig. 6. For each of these calculations, the chemical potential of the NaCl crystal is required at the respective temperature, which was calculated following the procedure outlined by Argones et al.38 and is presented in Table 1.
T/K | μ solid/kJ mol−1 |
---|---|
313.00 | −770.288(2) |
333.00 | −769.359(2) |
353.00 | −768.473(2) |
373.15 | −767.610(4) |
These chemical potential values of the NaCl solid, along with the density of states, were inserted into eqn (5) in order to generate probability distributions for each temperature, from which the NaCl solubility was determined as before. Counter-intuitively, the solubility of the NaCl model actually decreases as the temperature increases. This result has also been observed by Mester and Panagiotopouls,26 who calculated the solubility of the Joung–Cheatham model of NaCl at 373.15 K to be 3.01(5) mol kg−1 – in close agreement with our value of 2.99(4) mol kg−1. This unexpected behaviour is again attributed as a limitation of the model itself.26
A possible issue with the density of states approach for determining coexistence points is the potential for inadequate sampling of the coexistence states. The required nucleation step characterising first-order transitions (particularly the solid–liquid transition) is often suppressed as the creation of a surface involves an energy penalty. This is not an issue for the solubility prediction approach developed here. We are not sampling the dissolution of the solid nor its crystallisation but rather determining the density of states for the most part of the solution state, albeit around saturation.
There are three main sources of error within the methodology: errors associated with insufficient sampling, detailed balance not being satisfied, and the saturation of error caused by the modification factor reduction scheme. The errors due to saturation and detailed balance have been discussed at depth in the literature,29,39 and are expected to be small relative to the sampling error. Notably, the overall estimated errors in the solubility and chemical potential calculations as determined from five independent sets of simulation were relatively small.
The efficiency gain of the DOS approach with respect to total computing requirement is probably not that marked for a single point solubility calculation relative to existing methods, such as TI or perturbation. The key gain arises from the DOS approach's inherent ability to predict solubilities over a range of temperature and pressure conditions from a single density of states. The calculation of the solid chemical potential is the same regardless whether the DOS approach or an existing methodology is employed, and so no efficiency is gained here. For the solution phase, a chemical potential calculation for a single concentration at a single fixed temperature by TI would involve around 21 simulations, one per lambda state.8 Calculating the chemical potential of, say 10 concentrations, would then require 210 simulations. Should these then be repeated for the four additional temperatures considered by us, would result in a total of 1050 simulations. In contrast, for the DOS approach, we will probably need to investigate between 10–20 different concentrations to locate and fully capture the probability distribution that corresponds to the solution at saturation. For each of these concentrations we require three simulations – one to calculate the DOS of the solution phase, one to connect the solution and supercritical states, and one to connect the supercritical and gas states. For say 20 concentrations, this would amount to 60 simulations, plus one additional DOS calculation in the gas phase that connects the individual concentrations. Comparing the 61 DOS simulations with 1050 TI simulations yields a speed-up of about 17×.
While the DOS method has been applied only to a simple ionic system, we do not expect any significant challenges in extending the approach to larger solute molecules (including drug-like) in both aqueous and non-aqueous solvents. The switching from ions to molecules only requires a change in the density of the gas phase, avoiding the problematic creation and annihilation of particles in a condensed phase, which is a key benefit of the proposed method. Further, as the method samples only according to the density of states (i.e. entropy space), thermal barriers, such as those limiting dihedral rotations are expected to be less of an issue here than perhaps in other methods. For more challenging flexible molecules, the method could be coupled with established configurational-bias Monte Carlo moves to facilitate more efficient sampling of their molecular degrees of freedom.
In summary, we have developed and demonstrated a density of states approach to predicting solubility from molecular simulation. The method entails calculation of the density of states for a multicomponent solution, followed by exploration of the probability distribution as a function of number of solute particles in the system and the chemical potential of the solid, to identify coexistence conditions corresponding to the solubility. The density of states calculation is made possible by a unique pathway that avoids the problematic annihilation and/or creation of particles which is common to established methods. Consequently, the method is expected to perform well even for large, drug-like molecules. Further, it is able to yield, relatively efficiently, solubilities over a range of temperatures and pressures. The predicted solubility of the NaCl model at 298 K was found to be in close agreement with the literature.
This journal is © the Owner Societies 2018 |