Repulsion–dispersion parameters for the modelling of organic molecular crystals containing N, O, S and Cl

In lattice energy models that combine ab initio and empirical components, it is important to ensure consistency between these components so that meaningful quantitative results are obtained. A method for deriving parameters of atom–atom repulsion dispersion potentials for crystals, tailored to different ab initio models, is presented. It is based on minimization of the sum of squared deviations between experimental and calculated structures and energies. The solution algorithm is designed to avoid convergence to local minima in the parameter space by combining a deterministic low-discrepancy sequence for the generation of multiple initial parameter guesses with an efficient local minimization algorithm. The proposed approach is applied to derive transferable exp-6 potential parameters suitable for use in conjunction with a distributed multipole electrostatics model derived from isolated molecule charge densities calculated at the M06/6-31G(d,p) level of theory. Data for hydrocarbons, azahydrocarbons, oxohydrocarbons, organosulphur compounds and chlorohydrocarbons are used for the estimation. A good fit is achieved for the new set of parameters with a mean absolute error in sublimation enthalpies of 4.1 kJ mol 1 and an average rmsd15 of 0.31 Å. The parameters are found to perform well on a separate cross-validation set of 39 compounds.


Introduction
Crystal Structure Prediction (CSP) has become a valuable tool for determining and understanding the solid states of an increasingly wide range of compounds and mixtures. As highlighted in the most recent (2016) blind test, 1 CSP methods for organic molecules are now applicable to a wide range of real systems, including salts, hydrates and large exible molecules, with at least one successful prediction being achieved for each target system in the test.
tted to the HF/6-31G(d,p) charge density; intramolecular energy contributions were neglected. 10,18,[39][40][41] As a result, such parameters are not necessarily consistent with more sophisticated electrostatic models such as those based on multipoles. In addition, only parameters relating to the most common atoms (C, N, O, H, Cl, F) were derived from extensive datasets. For atoms such as sulphur or bromine, parameters from a variety of literature sources are currently being used, a practice that is likely to result in inaccuracies. Furthermore, even within the set of atoms considered in the FIT and W01 parameterizations, some cross interactions (e.g., O-N) were not tted to experimental data; instead, they were approximated using geometric combining rules that are known to be of limited validity and may lead to inaccurate predictions. 42 Overall, as has been stated elsewhere, 13 force elds should be carefully parameterized so that the different terms complement each other. In practical terms, repulsion-dispersion potentials need to be reparameterized for the specic electrostatics and intramolecular energy models with which they are intended to be used. There has been a recent effort 42 in this direction through the partial reparameterization of the FIT set using the B3LYP/6-31G** and B3LYP/6-311G** levels of theory and multipole expansions for the electrostatics. Focusing on O and N parameters, the work led to encouraging results, with experimental unit cells being reproduced to within 3% and hydrogen bond geometries and sublimation enthalpies between 7.4 and 9.0%.
In this paper, we present a systematic approach for the reparameterization of the repulsion-dispersion term using experimental structural and sublimation enthalpy data. The proposed algorithm can easily be used to derive empirical parameters consistent with different electrostatic models or levels of theory. Here, its applicability is demonstrated by deriving a set of exp-6 repulsion-dispersion potential parameters for C, H, N, O, S and Cl consistent with electrostatic interactions described by distributed multipoles derived from the M06/6-31G(d,p) level of theory. For the rst time, this leads to parameters for sulphur that are consistent with the remaining parameter set and this is shown to lead to a signicant improvement in performance.

The lattice energy model
To relate computational models of crystal structures to experimental data, it is generally assumed that the experimentally-observed crystals correspond to local minima of the Gibbs free energy. Moreover, given the relatively small contribution of entropic effects at low temperatures and the complexity associated with their calculation, the lattice energy is usually employed as an approximation of the free energy. Thus, the objective of CSP is the identication of all low-energy local minima in the lattice energy surface.

Dening the crystal structure
In frequently-used codes in CSP, such as the CrystalOptimizer 27,28 code on which we focus, the lattice energy of a crystal is a function of the following independent variables: The unit cell lengths and angles, collectively denoted by X.
The positions of the centres of mass and the orientation of chemical entities within the unit cell, collectively denoted by b.
The molecular conformational degrees of freedom (CDFs), i.e. bond lengths, angles and torsions, collectively denoted by q.
It is also useful to introduce two dependent quantities, i.e. variables that are functions of the independent variables: Q(q) denotes the set of parameters in the distributed multipoles 43 describing the molecule's electrostatic eld for a given molecular conformation q.
Y(q,b) denotes the Cartesian coordinates of all atoms in the unit cell, which depend on the molecular positions and orientations b and the molecular conformations q.

Lattice energy model for CSP
The lattice energy of a crystal U latt can be expressed as the sum of intramolecular and intermolecular energy contributions: where DU intra is the intramolecular energy contribution (i.e., the electronic energy aer subtraction of the gas-phase internal energies of the chemical species in the crystal weighted by their stoichiometric coefficients) and U e and U rd represent, respectively, the intermolecular electrostatic and repulsion-dispersion contributions. The intramolecular contribution DU intra (q) in the above expression is typically computed via a QM evaluation of the conformational energy at given q.
The intermolecular electrostatic model is based on a distributed multipole representation of electrostatics, placing an expansion up to the hexadecapole moment at each atomic position (as in DMACRYS 15 and hence CrystalOptimizer 27 ). The GDMA program 30 is used to derive the parameters Q(q) from the isolated molecule wavefunction; 29 the latter is also determined by the QM calculation used for the computation of DU intra (q) at given q. The intermolecular electrostatic energy, U e , is calculated as the sum of interactions between multipoles on atomic sites of different molecules. Such a representation of electrostatics has been shown to be successful in predicting highly-directional (anisotropic) lone-pair interactions, p-p stacking in aromatic rings and hydrogen bond geometries in organic crystals. [44][45][46] The repulsion-dispersion component of the intermolecular energy, U rd , is calculated as the sum of all repulsion-dispersion interactions within a predened cutoff. Each interaction, U rd ij , between two atoms of types i and j occurring in two different molecules and separated by a distance r is described through an isotropic atom-atom Buckingham potential: where the interatomic separation r between any two atoms can be derived from X and Y. In many current CSP methodologies, the values of the adjustable parameters A ij , B ij , C ij for atoms of the same type (i ¼ j) are usually taken from the FIT set; 18,31-36 for unlike atom types (i s j), the parameters are usually computed via combining rules: although more recently, 42 parameters A ij and C ij for some atom types have been regressed directly from data. The most common elements C, N, O, S, F, Cl are treated as transferable atom types in the FIT parameterization, but three distinct atom types are employed for hydrogen, namely hydrogen bonded to carbon, H C ; hydrogen bonded to nitrogen, H N ; and hydrogen bonded to oxygen, H O . Finally, in calculating U e and U rd , summations of intermolecular atom-atom interactions between the central unit cell and other cells in the crystal must be computed. This is done via a combination of direct and Ewald summations within a predened cutoff which is set to 15Å by default. 15

Lattice energy model for parameterization
While the intramolecular energy plays an important role in CSP for exible molecules, accounting for conformational changes leads to signicant increases in the computational cost of energy evaluations. As a result, in keeping with previous work on the parameterization of force elds for CSP, 31-36 we rely on a dataset that contains structures of relatively rigid molecules with no exible torsions. For such a dataset, intermolecular interactions within the solid will not signicantly distort the conformation of the molecule, and the experimental values of the molecular conformation variables, q exp , can be expected to be close to those for the global minimum in the QM energy function for the isolated molecule, q gas . However, due to experimental error and theoretical approximations, there are typically some small differences between q exp and q gas . Thus, for the purpose of the calculations, we x the CDFs to the values of q gas consistent with the chosen level of the theory. The intramolecular energy, DU intra , is then equal to zero and as a result the lattice energy model takes the simplied form: U latt (X,b;q gas ) ¼ U e (Y,X,Q(q gas )) + U rd (Y,X,p) where p denotes the set of model parameters A ij , B ij , C ij for all atom-atom interactions.
To assess the validity of the lattice energy model, and in particular of the parameter values p, we seek a local energy minimum close to the given experimental structure, (X exp ,b exp ). The lattice energy computed for a given p, denoted by U * latt ðpÞ, is given by: where the notation "X exp ,b exp " denotes that the values X exp ,b exp are used as the starting point for the local minimization of the lattice energy; the semicolon preceding p and q gas indicates that these quantities remain xed throughout the lattice energy minimization. The above minimization problem is solved subject to all appropriate symmetry considerations, and its solution determines the geometry of the predicted structure, denoted by (X*,b*). The quantities U * latt and (X*,b*) form the basis for comparison between predicted and experimental structures, as discussed in the next section.

Parameter estimation
An extensive set of experimental crystal property data is required to determine appropriate values for the repulsion-dispersion parameters. More specically, the properties that can be most directly related to the lattice energy model are the geometries of experimentally-observed crystal structures determined by diffraction methods, and sublimation enthalpies.

Quantifying deviations in crystal geometry
The structural features predicted by the model can be assessed by comparison to different types of experimental structure data as available in the Cambridge Structural Database (CSD). 47 As experimental X-ray diffraction (XRD) data are subject to relatively small uncertainties, they are valuable for parameter estimation. However, there are well-known difficulties in determining hydrogen positions 39 and in determining the bond lengths of polar or high-order bonds (such as C^N, C]C, C^C) due to the fact that interatomic nuclei distances in XRD are dened as distances between electron densities. 48,49 Thermal effects may be another source of error in structural data: typically, unit cell lengths in organic crystals may increase by around 3% over a temperature range of several hundred degrees. 50 In order to minimize the potential effects of experimental errors, our study applied the following criteria in selecting appropriate experimental crystallographic data from the CSD: 1. Relatively rigid molecules should be chosen. 2. Solvate and hydrate crystals are avoided.
3. Structures at high pressures (above ambient pressure) and above room temperature are rejected.
4. Single-crystal diffraction data are preferable over powder diffraction data. 5. Structures resolved via neutron scattering are preferred when available. 6. Structures resolved with XRD with an X-ray discrepancy factor (R) exceeding 5% are normally rejected, unless their sublimation enthalpies are also available or other criteria are met.
Several distance metrics are available for quantifying the deviation between two given crystal structures. 51 In our work, the deviation between experimental and predicted structures for a crystal i is described by a metric G i (p) that consists of two contributions. The rst contribution measures the relative deviations between the experimental and predicted unit cell parameter vectors, X exp i and X * i respectively: where N X i is the number of unit cell parameters not constrained by symmetry, and the jth such parameter in crystal i is indicated by subscript j. The second contribution measures the deviations between atomic positions within the asymmetric unit. Given the fractional coordinates Y 0 x;i;j , Y 0 y;i;j and Y 0 z;i;j ‡ of an atom j in the asymmetric unit of a crystal i, its relative position vector with respect to a reference atom, § denoted by the subscript "ref", is given by:  The contribution of atomic positions to the deviation measure G i (p) between experimental and predicted structures for crystal i is then given by: where N asym i is the number of atoms in the asymmetric unit cell for crystal i. The calculation of the relative position vectors is described in detail in the ESI. † In order to avoid introducing an inherent offset in the deviation metric G i (p), the fractional atomic coordinates in the experimental crystal are computed based on the predicted (rather than the experimental) molecular conformation. This is appropriate provided that the differences between q gas and q exp are small, a condition that needs to be veried when constructing the experimental dataset.
Overall, the function G i (p) used as a measure of similarity between experimental and predicted structures is dened as: It is worth noting that the above expression aims to balance the contributions of various terms via the application of appropriate scaling factors. Thus, relative (rather than absolute) deviations are used for the unit cell parameters. No such scaling is needed for the position vectors as they are already normalized. Moreover, since the two summations may comprise very different numbers of terms, the rst one is divided by the number of unit cell parameters not constrained by symmetry, N X i , and the second one by the number of fractional coordinates describing the asymmetric unit, 3(N asymi À1).
The quantity G i (p) is directly incorporated in the objective function of the parameter estimation problem. In analysing the quality of the results of the parameter estimation, the root mean squared deviation of the 15 molecule coordination sphere, rmsd 15 , 52 is used as an additional measure of similarity between predicted and experimental crystals.

Quantifying deviations in sublimation enthalpy
The accuracy of the computed lattice energies can be assessed from sublimation enthalpies reported in thermophysical properties databases such as NIST 53 and DETHERM. 54 Uncertainties in measured sublimation enthalpies can be very large due to several factors such as polymorphism, a lack of standards for compounds with low vapour pressures, chirality, systematic errors associated with the measurement techniques and further errors that can be introduced in measurements by adjustment to a reference temperature (which is necessary when comparing different measurements). 55 Based on 451 measurements of 80 compounds, Chickos et al. 55 estimated the average uncertainty of sublimation enthalpies to be 4.9 kJ mol À1 . Consequently, our study does not make use of any sublimation enthalpy measurement with a reported error exceeding 4.9 kJ mol À1 . A calculated enthalpy of sublimation at temperature T and pressure P, DH * sub ðT; PÞ, can be obtained from the predicted lattice energy U * latt by the following commonly-used approximation: [56][57][58] DH * sub ðT; PÞz À U * latt ðT ¼ 0 K; PÞÞ À 2RT; (10) in which the zero-point energy is neglected. A derivation of eqn (10) is given in the ESI. † In previous studies by Arnautova et al. 59 and Gavezzotti, 60 computed lattice energies were directly compared to experimental sublimation enthalpies without any correction, an approach justied by the authors on the basis of uncertainty in the sublimation enthalpy data. We prefer to apply the correction shown in eqn (10) in order to reduce any systematic error in the comparison between predicted and experimental values. Thus, we employ the following deviation function as a measure of the difference between the experimental and predicted sublimation enthalpies for a given crystal i:

Obtaining globally optimal parameter estimates
Parameter estimation aims to determine the values of parameters p for which the calculated quantities best match the experimentally-observed quantities. This can be expressed mathematically as the minimization of a combined deviation metric R(p): where I G and I E are the sets of crystal structures for which we have reliable experimental data on the geometry and sublimation enthalpy respectively. In general, I E is a subset of I G , reecting the fact that reliable sublimation enthalpy measurements may not be available for all crystal structures under consideration. The constants w G i and w E i are non-negative weights that can be adjusted to reect differences in the reliability between different experimental data and to yield a desirable trade-off between geometry and energy reproduction.
In general, we seek to establish the values of p that lie within given lower and upper bounds, p l and p u respectively, and lead to a globally minimal value of R(p). To compute R(p) at a given set of parameter values p, we need to perform the lattice energy minimization (5) for each crystal structure i˛I G starting from the corresponding experimental structure. Once the optimal solution is obtained, we can compute the geometry deviation G i (p) and, if i˛I E , the sublimation enthalpy deviation E i (p). The quantities G i (p) and E i (p) can then be used to compute R(p) via eqn (12). Overall, this is a relatively expensive computation as it involves a large number of crystal structure minimizations, but these can be carried out in parallel.
A more serious challenge for the reliable solution of problem (12) is that the objective function R(p) may have multiple local minima, in which case the solution obtained via the application of standard local minimization algorithms may depend on the starting points (i.e. the initial guesses) for the parameters p.
In order to increase the likelihood of identifying the globally optimal parameter vector, we adopt a step-wise approach involving the generation of many starting points in the space of parameters p, followed by a gradient-based local minimization from selected starting points: 1. Compute the reference quantity R(p FIT ) where p FIT are the values of the parameters in the existing FIT parameterization.
2. Generate N points p k , k ¼ 1, ., N, in the parameter space by means of a Sobol' sequence 61,62 within the dened bounds [p l ,p u ]. The advantages of lowdiscrepancy (such as Sobol') sequences over other methods (e.g. Monte Carlo sampling or uniform grids) for efficiently searching multivariable domains are well established. 63 3. For each point p k , evaluate the corresponding R(p k ) as described above; discard all points for which R(p k ) exceeds the reference value R(p FIT ) by more than a specied factor a (e.g. 5) as these are unlikely to yield parameter estimates that are better than the FIT one.
4. Starting from p k as the initial guess, perform the local minimization (eqn (12)) to obtain a (locally) optimal parameter estimate p* ,k with a corresponding objective function value R(p* ,k ). As described in the ESI, † a sequential quadratic programming (SQP) algorithm with a Broyden-Fletcher-Goldfarb-Shanno (BFGS) 64 update of the Hessian matrix is used for this purpose.
5. Choose the solution p* ,k with the lowest corresponding value R(p* ,k ) as the best parameter estimate p*.
A owchart of the overall parameter estimation methodology is shown in Fig. 1.

Cross-validation of optimal parameter estimates
To further assess the quality of the optimal parameter estimates p* obtained by the procedure described in the previous section, we examine its ability to reproduce experimental geometry and sublimation enthalpy data both for the structures used for the estimation, and for a separate set of structures reserved for cross-validation purposes. The quality of geometry reproduction is assessed based on the usual rmsd 15,i measure between an experimental and predicted crystal i calculated using the COMPACK 52 code. We also consider the rmsd 15 averaged over a set of crystal structures.
For assessing the quality of sublimation enthalpy prediction for each individual crystal i, we use the Absolute Deviation (AD i ) dened as: as well as the average and maximum values of AD over a set of N str crystal structures: and maxAD ¼ max We generally aim to derive parameter estimates for which the average values AAD and rmsd 15 do not exceed 5 kJ mol À1 and 1Å respectively. 22

Parameterization strategy
For a parameter estimation study involving N A distinct types of atoms, there are N A (N A + 1)/2 possible interaction pairs that need to be considered, and consequently 3N A (N A + 1)/2 repulsion-dispersion parameters that need to be estimated from experimental data. Overall, even for relatively small N A , this is a difficult and computationally expensive problem to solve.
An approach commonly used for reducing the parameter space is to use combining rules for the parameters describing interactions between atoms of different types. The use of such combining rules for the dispersion coefficient does have some physical justication at the level of the Unsöld approximation to the r ij À6 dispersion. However, other rules are known to be limited in their accuracy, 65 and their use may lead to lower quality of predictions in crystal structure calculations. 32,42,66 Therefore, in this study we prefer not to use combining rules,  relying instead on the estimation of the cross-interaction parameters directly from experimental data. On the other hand, and in common with the approach adopted by Pyzer-Knapp et al., 42 we do hold the exponential B ij parameters xed at the FIT values: the shape of the potential is very sensitive to even small changes in these parameters, which renders their numerical handling as part of the parameter estimation procedure problematic.
Finally, in order to make the problem computationally manageable and ensure compatibility and transferability of the different sets of parameters, the parameter estimation is carried out in a sequential manner as shown schematically in Fig. 2. Similarly to the work of Sun, 67 optimal parameters for carbon and hydrogen (H c ) are determined rst based on experimental data for hydrocarbons. These are then xed for the subsequent steps, and interaction parameters for other atom pairs are estimated from data on appropriate compounds.

Illustration of methodology
The methodology proposed in this paper is illustrated for the estimation of parameters for the C/C, C/H C and H C /H C interactions using experimental data for hydrocarbon crystals (cf. the top box in Fig. 2). As indicated in the owchart in Fig. 1, we start by collecting relevant experimental data based on the general criteria described in Section 3.1. The experimental crystallographic data and energetic data collected for hydrocarbons are listed in Table S1 of the ESI. † This consists of 19 compounds and exhibits a diversity of hybridizations not present in the training sets for the FIT parameterization.
We then perform an isolated-molecule gas-phase QM calculation for each molecule using Gaussian09 (ref. 68) with the M06 (ref. 69) functional and a 6-31G(d,p) basis set. The resulting wavefunctions are used to generate multipole moments via the distributed multipole analysis method 43 as implemented in the GDMA 2.2 program. 43 We now x the three B ij parameters to their FIT values, and aim to determine optimal estimates for the 6 A ij and C ij parameters; the latter constitute our parameter vector p. For the purposes of this estimation, we will allow p to vary within AE40% of the FIT values. As explained in Section 3.3, in order to increase the probability of obtaining globally optimal parameter estimates, we use a Sobol' sequence to generate 1000 starting points p k , k ¼ 1, ., 1000. Before proceeding with actually performing local minimizations of function R(p) (cf. eqn (12)) starting from some of the above points, it is instructive to gain some insight as to the appropriate choices of the weighting factors w G i and w E i . For simplicity, we use the same weight values (w G or w E ) for all 19 structures i. We consider 15 distinct combinations of these weights as shown in Table 1, and evaluate the objective function R(p k ) for each such combination. The 19 000 local lattice energy minimizations required for evaluating R(p k ), k ¼ 1, ., 1000 are performed with DMACRYS at 0 Pa pressure.
For each weight combination, having evaluated the objective function R(p k ) at the 1000 Sobol' points p k , we select the Sobol' point with the lowest value of the objective function. This serves as an approximation to the optimal parameter estimate p* that might be obtained using that particular combination of weights. The solutions arising from different weight combinations are compared in Fig. 3 based on the geometry and sublimation deviation functions, G(p*) and E(p*), averaged over the 19 crystal structures under consideration.
The maximum and minimum values of G i (p*) across all 19 crystal structures i are also depicted on the horizontal axis of Fig. 3 while the maximum and minimum E i (p*) are shown against the vertical axis. We note that several weight ratios lead to the same solution; in fact, only 4 distinct solutions are obtained, as depicted by the solid symbols in Fig. 3. Based on these results, the weights w G ¼ 1 and w E ¼ 1 appear to offer both the best trade-off between geometry and energy errors, and the smallest variation of these errors across the different structures being considered. Accordingly, these values are chosen to be used throughout our parameter estimation methodology in this paper.
The values of R(p k ) for the weights w G ¼ w E ¼ 1 at the 1000 points p k are plotted in Fig. 4. The corresponding reference value R(p FIT ) for the FIT parameter values p FIT is also shown. We note that the FIT parameter point, p FIT , lies among the lowest points in this gure, which suggests that the FIT parameter values are already good estimates, a result that was expected given the similarity between the training set used in this work for hydrocarbons and that used by Williams et al. 31 The FIT reference value R(p FIT ) allows us to prune out any Sobol' points that are unlikely to result in an improved t of the experimental data over the already available FIT parameters (cf. step 5 of the algorithm in Fig. 1). Here we apply a cutoff ratio a of approximately 5, which excludes all but 250 of the original  Sobol' points from further consideration. We then solve the optimization problem expressed by eqn (12) starting from each of the remaining 250 points. The initial objective function values R(p k ) and the nal objective function values aer local minimization, R(p* ,k ), are shown in Fig. 5. It can be seen that the minimization does result in a signicant reduction of the objective function R(p).
The results suggest a "at" objective function with many broad and shallow minima; however, it is also possible that some of the points are not true local minima, but simply arise due to premature termination of the challenging local minimization.
The parameter values p* resulting in the lowest objective function value are given in Table 2 where they are also compared with the original FIT parameters. A comparison of the corresponding objective function values is presented in Table 3. The objective function value for p* is 63% lower than the corresponding value for the FIT parameters. The new parameters also achieve 23% and 72% reductions in the geometry and energy deviation functions respectively. Overall, the new parameters yield a signicantly better reproduction of both the geometries and energies of the crystals in the training set even in the case of such   a relatively simple and well-studied parameter set. Further details of the quality of t for each individual structure are presented in Table S4 of the ESI. †

Results and discussion
In this section, the stagewise procedure depicted in Fig. 2 is applied to derive optimal estimates of the repulsion-dispersion interaction parameters for a range of atoms. The parameters obtained are consistent with a distributed multipole electrostatic model derived from charge densities calculated at the M06/6-31G(d,p) level of theory.

Choice of training set and cross-validation set
The crystal structures of 106 organic molecular crystals including hydrocarbons, azahydrocarbons, oxohydrocarbons, organosulphur compounds, chlorohydrocarbons, amines, imidazoles, carboxylic acids and alcohols were collected from the CSD 47 for the purpose of the parameter estimation reported in this work. The training set was selected to contain molecules with each element in different bonding situations. For example, the training set of hydrocarbons contains alkanes and cycloalkanes with sp 3 hybridized carbons that form single bonds, aromatics with sp 2 hybridized carbons forming sigma bonds and the delocalization of electrons between the ring carbons and hydrocarbons with an alkynyl group of sp hybridized carbons that form a triple bond. Similar considerations underpinned the selection of molecules containing N, O and S, as summarized in Table 4. The molecular diagrams, CSD refcodes, numbers of molecules in the unit cell Z, space groups, R-factors and experimental temperatures are summarized in Table S1 of the ESI. † Gas-phase molecular conformations are computed at the M06/6-31G(d,p) level of theory. The rigidity of the molecules is ascertained through rmsd 1 comparisons between computed and experimental conformations with COMPACK, giving an average rmsd 1 of only 0.05Å. In a previous study by Arnautova et al., 59 the C-H bond lengths of the experimental conformations were adjusted to the average experimental value from neutron diffraction data to overcome issues of unreliable hydrogen positions determined by X-ray diffraction. We found the C-H bond lengths in the gas-phase conformations to be approximately 1.087Å, which is very close to the average experimental value of 1.089Å derived from neutron data given in the CSD. Therefore, no modication was applied to the C-H bond lengths.
Experimentally determined enthalpies of sublimation were identied for 53 of the 106 molecules, which is signicantly more than the number used in the   parameterization of FIT. The sublimation enthalpies used in this work are given in Table S1 of the ESI. †

Optimized parameters
The optimal parameter estimates for the 24 atom-atom interactions occurring in the datasets are listed in Table 5, where they are also compared with the corresponding FIT parameters. The optimized parameters for sulphur interactions are further compared with two additional sets of parameters, namely those by No et al. 70 (denoted as S-No) and by Abraha et al. 71 (denoted as S-Ab). These alternative parameter sets are presented in Table 6. The optimized parameters differ signicantly from the original set. However, because of the high degree of correlation between the A ij and C ij parameters, large changes in their individual values do not necessarily result in a large change in the potential. Fig. 6 shows the differences in equilibrium distances and well depths of the potentials. For most pairwise interactions, these differences are relatively small. A striking exception is the O/O interaction for which we observe a signicant change in the potential well depth and location of the minimum (cf. Fig. 7a), with the new parameters indicating stronger and shorter-range interactions. On the other hand, the differences between our potentials for the interactions of oxygen with other atoms and the corresponding FIT potentials are not very pronounced. This is an indication of the inaccuracies that may be introduced by relying on combining rules for estimating cross-interaction parameters. Shorter equilibrium distances and deeper wells are also observed for all S interactions apart from the H C /S one (cf. Fig. 6 and 7b and c). We note that signicant differences also exist among the FIT, S-Ab and S-No parameter sets as shown in Fig. S3 of the ESI. † Both oxygen interactions and sulphur interactions were tted to a number of sublimation enthalpy data. We note that no such data were used in the derivation of the FIT set. The resulting improvement in sublimation enthalpy prediction demonstrates the impact of using energetic data to t the potentials.

Performance of new parameters: sublimation enthalpy
Overall, the new parameter values result in a good t for the experimental structures in the training set. As shown in Table 7, the sublimation enthalpy AAD is  4.1 kJ mol À1 , which is 53% smaller than the corresponding FIT value while the average rmsd 15 of 0.31Å is practically identical to that achieved by FIT. In fact, for hydrocarbons and azahydrocarbons, the achieved sublimation enthalpy AADs of less than 2 kJ mol À1 are similar to the errors observed with some dispersion-corrected DFT methods. 7 There is also a decrease in the spread of the sublimation enthalpy errors, with a maxAD value of 16.7 kJ mol À1 compared to 38.7 kJ mol À1 with FIT. In addition to the training set, we employ a cross-validation set consisting of 39 crystal structures for representative molecules for all the classes of compounds considered here (see Table S2 in the ESI †). For six of these structures, the sublimation enthalpy is also available. The criteria for the choice of the crossvalidation set were similar to those for the training set (cf. Section 3.1) although, due to the limited availability of such data for rigid molecules, the requirements regarding experimental error were less stringent in some cases. The sublimation enthalpy results reported in Table 8 indicate similar improvements to those for the training set (cf. Table 7). However, the sublimation enthalpy AAD of 7.87 kJ mol À1 is larger than the corresponding value for the training set. This may be due to the relative sparsity of energetic data, which leads to reduced statistical signicance for the parameters. We also note that some of the experimental sublimation enthalpy values in the cross-validation set are less reliable than the enthalpies reported for the compounds in the training set. Fig. 8a shows parity plots between experimental and computed sublimation enthalpies for both the training and cross-validation sets. A signicant improvement over the FIT set is evident across both sets. Fig. 8b focuses on the

Performance of new parameters: rmsd 15
As mentioned in previous sections, the rmsd 15 between the optimized unit cells and the experimentally observed ones is also used to assess the ability of our optimal parameter estimates to accurately predict unit cell geometries. An rmsd 15 below 0.5Å is generally considered to indicate a good degree of agreement between two structures. In our case, the average rmsd 15 over the entire training set was 0.31Å. The rmsd 15 averages for each separate set, shown in Table 7, are below 0.5Å. Overall, we can conclude that there is a good agreement with the experimental structures for each class although a few individual structures show rmsd 15 values nearer 1.
As indicated by the entries in the last rows of Tables 7 and 8, our new parameter set does not have a clear overall advantage over FIT in terms of rmsd 15 . This is partly explained by the fact that we used a different metric for geometry in our objective function. In our methodology, improved geometry reproduction could be achieved by setting w G [ w E ), thereby placing greater importance on the geometry terms than the energy terms. However, the results presented here demonstrate that using equal weights leads to a signicant increase in the accuracy of the computed sublimation enthalpies while maintaining good accuracy in geometry reproduction. In particular, as shown in Fig. 9, the new parameters lead to fewer points showing large errors in the sublimation enthalpy (>5 kJ mol À1 ) or/and in the rmsd 15 (>0.5Å).

Concluding remarks
The fundamental premise of the work reported in this paper is that the values of the repulsion-dispersion parameters estimated from experimental data depend on the model of electrostatic interactions{ used in this estimation. Accordingly,  this electrostatic model needs to be the same as that used in the subsequent CSP calculations. The approach proposed here allows the systematic reparameterization of the repulsion-dispersion model to derive transferable parameters that are consistent with the rest of the lattice energy model. This is particularly important in the context of current CSP methodologies which are relying on increasingly sophisticated representations of electrostatic and intramolecular contributions based on ab initio quantum mechanical calculations. The estimation of the repulsion-dispersion parameters in the lattice energy model has been formulated mathematically as a weighted least-squares minimization of the deviation between the computed and experimentally measured crystal geometries and sublimation enthalpies. Important aspects of our formulation are the weighting and scaling schemes used for balancing the contributions of different elements of the experimental dataset. The proposed solution algorithm aims to mitigate the large computational cost associated with this problem and to reduce the risk of the minimization being trapped in local minima.
The feasibility and effectiveness of the proposed approach were demonstrated by the estimation of a set of atom-atom interaction parameters for the isotropic Buckingham potential, consistent with distributed multipole electrostatics derived from charge densities computed via isolated-molecule quantum mechanical calculations at the M06/6-31G(d,p) level of theory.k Interaction parameters between unlike atom types were also included directly in the estimation instead of being approximated via combining rules. The study reported here focused on organic molecular crystals with N, O, S and Cl atoms, and made use of a training dataset containing 106 crystal geometries and the sublimation enthalpies for 53 of these structures. A separate set of 39 structures (including 6 sublimation enthalpy values) was used for cross-validation of the derived parameter estimates. Overall, compared to the commonly used FIT parameter set, the new parameter values were found to result in signicantly improved sublimation enthalpy predictions while maintaining a comparable quality of geometry reproduction.
In the work reported here, the B ij parameters of the Buckingham potential were kept xed at the corresponding FIT values. This was mainly done in order to avoid the numerical problems caused by the unphysical predictions of this potential at low interatomic separation distances. Including these parameters in the estimation set could result in improved predictive accuracy. However, with the availability of systematic ways of estimating reliable and consistent parameter values for any form of repulsion-dispersion potential, it may be worth considering alternative potential forms that do not exhibit such unphysical behavior in the rst place.

Data statement
All underlying data supporting this article are available in the ESI. †

Conflicts of interest
There are no conicts to declare.