Christopher R. Taylor,
Patrick W. V. Butler and
Graeme M. Day*
School of Chemistry, University of Southampton, Southampton, SO17 1BJ, UK. E-mail: g.m.day@soton.ac.uk
First published on 3rd June 2024
Computational crystal structure prediction (CSP) is an increasingly powerful technique in materials discovery, due to its ability to reveal trends and permit insight across the possibility space of crystal structures of a candidate molecule, beyond simply the observed structure(s). In this work, we demonstrate the reliability and scalability of CSP methods for small, rigid organic molecules by performing in-depth CSP investigations for over 1000 such compounds, the largest survey of its kind to-date. We show that this highly-efficient force-field-based CSP approach is superbly predictive, locating 99.4% of observed experimental structures, and ranking a large majority of these (74%) as among the most stable possible structures (to within uncertainty due to thermal effects). We present two examples of insights such large predicted datasets can permit, examining the space group preferences of organic molecular crystals and rationalising empirical rules concerning the spontaneous resolution of chiral molecules. Finally, we exploit this large and diverse dataset for developing transferable machine-learned energy potentials for the organic solid state, training a neural network lattice energy correction to force field energies that offers substantial improvements to the already impressive energy rankings, and a MACE equivariant message-passing neural network for crystal structure re-optimisation. We conclude that the excellent performance and reliability of the CSP workflow enables the creation of very large datasets of broad utility and explanatory power in materials design.
A major resource for such efforts has been the growth of accessible, curated databases of crystal structures and (some of) their properties. The most general and widely-used include the Cambridge Structural Database (CSD)1 for organic and organometallic systems, and the Inorganic Crystal Structure Database.2 These resources are unparalleled in their volume of experimental crystal structure information, but do not currently offer information about calculated or hypothetical structures. Historically, such data was limited to specialised areas (such as the Atlas of Prospective Zeolite Structures3), though recent developments have made remarkable progress in generalising this concept, including extensions to the Crystallography Open Database (COD)4–7 and the Materials Project.8 In the field of organic molecular crystals, however, much of our understanding of the rules of crystal packing derive from databases of experimentally observed structures.
Modern computational chemistry, employing both molecular and solid-state simulation techniques, can add significantly to the information that is available from experimentally determined crystal structures, and identify previously unverifiable trends. Computational studies of polymorphism9–13 have evaluated the typical lattice energy differences between crystalline polymorphs of organic molecules, studies of conformations of flexible molecules11,14 in their observed crystal structures have improved understanding of the limits of molecular strain in stable structures, and studies of the thermodynamics of co-crystallisation15–18 have aided in rationalising a complex phenomenon with ramifications for experimental design.
A more complete and, crucially, predictive view of organic crystal packing can be obtained from crystal structure prediction (CSP).19,20 A core concept in CSP is the crystal energy landscape (or CSP landscape) – the set of plausible crystal packings for a chemical species (or combination thereof), representing an exploration of the crystalline configuration space to identify candidate structures predicted to be stable (which ideally includes any observed structures), ranked in terms of their energetic stability. These CSP landscapes are of great utility in understanding and rationalising the thermodynamic and kinetic behaviour of crystal systems; multiple low-energy minima that are close-lying on a CSP landscape may be indicative of a significant risk of polymorphism, while dynamical simulations exploring these landscapes give insight into the kinetic trapping of metastable forms21 and the observed absence of other predicted forms.22,23
Moreover, the energy landscape forms the foundation of the energy–structure–function maps that have in recent years demonstrated great power in materials discovery.24–27 By associating computed properties (e.g. gas uptake, charge carrier mobility) with hypothetical crystal structures from the CSP landscape, it becomes possible to predict whether a molecule is a promising candidate for creating new functional materials, i.e. if it has one (or more) favourably-ranked crystal structures which are predicted to achieve the desired property.
The techniques and challenges in the field of organic CSP have been reviewed;19,20 we provide a brief overview for the sake of context. CSP is typically considered a combination of two broad challenges: efficient and thorough sampling of the configuration space of crystal packing, and accurate, cost-effective structural optimisation, ranking, and property calculation.
The sampling of hypothetical crystal packing arrangements is made extremely difficult due to the “curse of dimensionality”—the number of independent degrees of freedom to sample creates a vast configurational space. As a result, simple grid-based sampling approaches must be eschewed in favour of more sophisticated techniques, such as low-discrepancy quasi-random sampling28,29 and genetic algorithm approaches.30
Meanwhile, the optimisation and ranking methods must be accurate enough to describe the fine balance of different intermolecular interactions (electrostatics, dispersion, hydrogen-bonding, etc.), resolving lattice energy differences often smaller than a kJ mol−1, while sufficiently cost-effective to be applied to very large numbers (≫105) of trial crystal structures. Historically, this has entailed the use of simple empirical force fields, but modern developments often employ tailor-made force fields18,31 or machine-learned potentials derived from ab initio calculations. Still, empirical force fields retain their power even today due to their efficiency and broad transferability, often being the initial step of a hierarchy of increasingly accurate (and expensive) energy models employed in one CSP workflow.
Despite these two broad and ongoing challenges, organic CSP has demonstrated enormous success in diverse applications, crucially proving itself to be a truly predictive technique, guiding synthesis and discovery of novel forms of porous materials,24 highly-flexible pharmaceutical molecules,31 co-crystals,32 simple molecules previously thought to be monomorphic33 and templating of predicted metastable polymorphs.34 A recurrent landmark in the field is the series of Blind Tests of CSP, showcasing the diversity of methods (and success rates thereof) employed within different CSP techniques to predict experimental crystal structures without any knowledge beyond the molecular chemical diagram.35 Recent iterations of the Blind Test have demonstrated that CSP method development is successfully keeping apace with the complexity of molecules and crystal structures specifically selected to stress-test it.
Some of the most consequential recent developments in CSP have employed machine learning (ML) techniques in one form or another.36 Among the most intuitive applications is the use of ML to learn relationships between the structure and lattice energy, either by learning the difference (i.e. Δ-ML) between lattice energies computed at a lower level of theory to those at a higher level,37–40 or to learn the relationship between the lattice energy and the ML descriptors directly through the training of ML potentials.41,42 These approaches have achieved high accuracy predictions at a fraction of the cost of the full periodic density functional theory (DFT) reference calculations – particularly significant given the latter's ubiquity in recent Blind Test entries.
That said, ML techniques have further applications in CSP beyond improved optimisations and energy rankings. In particular, ML and related approaches applied to databases of experimental crystal structures and properties have demonstrated success in predicting NMR chemical shifts43 and in formulating models of molecular hydrogen bond propensities within crystals.44 Recent work by Cersonsky45 has demonstrated machine-learning of the relationship between crystal lattice energies and the relative contributions to these by different chemical functional groups, paving the way for data-driven insights into new crystal engineering techniques. ML has also been shown to have potential to enhance the analysis of crystal energy landscapes, by identifying structure–function relationships that might evade simple inspection but nevertheless offer explanatory and predictive power.46
Our aims in this work are threefold. Firstly, we seek to demonstrate the capability of our method for efficient, large-scale rigid-molecule organic CSP by presenting the results of the largest to-date CSP study, applying the methods to over 1000 molecules with observed crystal structures in the CSD. Secondly, we in turn assess the quality and reliability of our method by evaluating how often the experimental crystal structures of these molecules are reproduced in our CSPs, and how well they are ranked energetically compared to hypothetical structures on their CSP landscapes. Finally, we demonstrate example applications of this dataset, including assessing the distribution of CSP-derived space group preferences as a function of predicted lattice energy, spontaneous resolution of chiral molecules and the training of transferable machine-learned energetic models from a very large set of CSP landscapes.
Distributed multipole analysis (DMA,53,54 using the GDMA package) was performed on the resulting molecular conformers' charge densities to obtain atom-centred multipoles, as part of the model potential applied during lattice energy minimisation of trial crystal structures; multipoles up to hexadecapole were calculated for all atoms. The MULFIT55,56 software was used to fit atomic point charges for each molecule to best describe the molecular electrostatic potential. The resulting molecular conformers and their sets of multipoles are then used as the inputs to CSP, with each unique molecule represented by a single conformer and corresponding set of atomic multipoles and charges.
Space group symmetry is used to reduce the dimensionality of the search space, so that only the position and orientation of molecules in the asymmetric unit are sampled, with all other molecules in the unit cell generated by symmetry. In this study, we restrict ourselves to generating crystal structures with one independent molecule in the asymmetric unit (Z′ = 1). We sampled the 26 most commonly observed space groups for organic molecular crystals (listed in ESI†); these space groups cover over 99.4% of Z′ ≤ 1 structures in the CSD. These space groups were sampled equally, irrespective of their observed frequency in the CSD: quasi-random structures are generated and lattice energy minimised until 10000 successfully energy minimised crystal structures were found in each space group (260000 structures per molecule). The CSP process is highly parallelisable, as each crystal structure is independent.
The trial crystal that passed geometric checks were lattice energy-minimised in three stages. Non-electrostatic interactions (principally intermolecular dispersion and exchange-repulsion) were described by the FIT exp − 6 force field,57,58 supplemented by fluorine parameters from Williams and Houpt.59 At the final stage of optimisation, performed using DMACRYS,58 the FIT potential was applied with atomic multipole electrostatics. Full details are provided in ESI.†
It is commonplace that multiple unique initial configurations optimise to the same local energy minimum. We remove these duplicates by fast comparison of simulated powder X-ray diffraction patterns, followed by structural comparisons using the COMPACK60 algorithm as implemented in the CSD Python API.
The initial dataset (before the active learning iterations) was generated by randomly selecting up to 10 low energy predicted crystal structures for each compound, resulting in a dataset of 10249 structures approximately evenly distributed across the compounds. To evaluate transferability, the total dataset was partitioned into a training dataset consisting of the CSP structures for a randomly selected ca. 85% of the compounds and an extrapolation test set consisting of CSP structures for the remaining compounds. A further in-domain test set was formed by randomly extracting one structure per compound from the training set. The NNPs were then trained on the remaining training dataset to yield a Δ-ML model capable of predicting the lattice energy correction. The standard deviation between the ensemble of NNPs was used to estimate the uncertainty of predictions, and was exploited in the active learning iterations to add high uncertainty candidates from the remaining low energy predicted structures of the training compounds, which overall added a further 1000 structures to the training set. Corrected CSP landscapes were calculated using the final model by adding predicted lattice energy corrections to the FIT+DMA energies (FIT+DMA+Δ-ML). Additionally, for performing unconstrained geometry optimisations MACE equivariant message-passing neural network (MPNN) models were trained using a dataset derived from that of the NNP correction model. Further details of the datasets and machine learning models are provided in the ESI.†
Functional group | Count | Functional group | Count | Functional group | Count |
---|---|---|---|---|---|
Benzene | 418 | Ether | 572 | Ester | 134 |
Pyridine | 76 | Amide | 297 | Carbonyl | 702 |
Imidazole | 35 | Imide | 43 | Imine | 13 |
Furan | 22 | Lactone | 91 | Secondary amine | 405 |
Piperdine | 51 | Epoxide | 73 | Tertiary amine | 537 |
Urea | 22 | Ketone | 269 | Halogen (F) | 240 |
As a measure of the typical deviation in molecular geometry between the observed crystal structures and the DFT optimised molecules used in CSP, we present in Fig. 2 the histogram of all-atom root-mean-squared deviations (RMSD) in atomic positions between the gas-phase conformers used in CSP and their in-crystal initial conformations as extracted from the CSD.
As might be expected from such rigid molecules, the average RMSD after gas-phase optimisation is very small – approximately 0.11 Å, which corresponds to e.g. adjusting the C–H bond lengths in fluorobenzene by 0.16 Å. While the distribution is skewed towards small conformational changes, the outliers with larger RMSD values demonstrate the limitations of defining molecular flexibility in terms of rotatable bonds alone. The largest RMSD values correspond to systems where changes in ring conformation cause large overall molecular changes – the largest observed RMSD of 0.60 Å occurs in 7-oxa-1-azaspiro(4.4)non-1-en-6-one 1-oxide (CSD reference code: DOBYOJ), in which a 5-membered saturated ring can twist about a spiro carbon centre. The highest molecular RMSD values are largely associated with molecules containing spiro carbon atoms (denoted by the red portion of the bars in Fig. 2).
Regardless, the overall inflexibility displayed by this set serves as a strong indicator that our assumption of a single, near-crystalline conformer of each molecule is reasonable, and unlikely to be a systematic source of error in predicting many known structures of molecules in this set.
In contrast, our rigid molecule set displays a diminished frequency of hydrogen bonding (H-bonding) in the observed crystal structures compared to the CSD more generally, as might be expected by excluding rotatable bonds. 26.7% (277) of crystal structures of molecules in our subset contain at least one intermolecular H-bond. This compares to 62.3% of structures of molecules of similar size without rotatable bond restrictions, demonstrating the bias introduced by the omission of common H-bonding groups due to their flexibility, including alcohols (–OH), carboxylic acids (–COOH), and primary amines (–NH2). However, the proportion of H-bonded systems is still significant enough that meaningful H-bond chemistry is incorporated in our survey, albeit underrepresented. Consequently, our set in turn over-represents chemistry such as π–π stacking and “weak” (i.e. more isotropic, less localised) interactions, and assessments of our CSP energy model's performance must be made with these biases in mind.
The 26 space groups included in our standard search include those with an observed frequency above 0.05% in the CSD. Naturally, in a very large survey of molecules, we include some that crystallise in less frequent space groups; for these 8 cases, we added the space group of the observed crystal structure to the search. In 3 additional cases, the symmetry of the observed structure means that it could not have been sampled without performing CSP with multiple independent molecules (Z′ > 1); in these cases, we include the datasets in our study, knowing that the observed crystal structures could not have been located. Of those that could have been located, the searches find matches for 1034 of the 1040 observed crystal structures: reasons for the 6 missed matches are discussed below.
A main assumption in CSP is that the observed crystal structures of a molecule correspond to the lowest energy possible structures. We use this assumption to assess the quality of the calculated energies. Fig. 5 summarises the energetic ranking of experimentally observed crystal structures within the CSP landscapes via the distribution of ΔE, the energy difference between the prediction matching the experimental crystal structure and the global lattice energy minimum crystal structure on their parent molecule's CSP landscape; an example landscape is shown in Fig. 4. In the case of molecules with chiral centers and only one stereoisomer present, ΔE is calculated only among Sohncke space groups (those whose symmetry elements contain only translations, rotations and roto translations).
Of the 1034 experimentally determined crystal structures where a matching structure was identified on the CSP landscape, 424 (41%) correspond to the global lattice energy minimum from CSP. Those that do not correspond to global energy minima could be due to inaccuracy of the model potential (FIT+DMA), neglect of other contributions to the lattice free energy or where the kinetics of crystallisation favour a metastable structure. Nyman and Day found that, for crystals of rigid molecules, lattice vibrational contributions to room temperature free energy differences between polymorphs rarely exceed 2 kJ mol−1;9,10 767 (74%) of the observed crystal structures are found within 2 kJ mol−1 of the global energy minimum – the estimated error introduced by neglecting lattice vibrations and thermal expansion in the CSP calculations.
Furthermore, the known structure(s) almost always lie within 8 kJ mol−1 of the global lattice energy minimum (1011, 97.8%, of the observed crystal structures), consistent with the observation10 that known polymorphic pairs of small, rigid molecules are rarely separated by more than this energy difference. Thus, the energy model used here, combining empirically parameterised repulsion–dispersion with atomic multipole electrostatics, provides energy ranking of crystal structures that is consistent with observed polymorphism, within the limits of temperature-free lattice energy-based predictions.
We quantify the geometric quality of the predictions using an all-atom RMSD within 30-molecule clusters (RMSD30) from experimentally-determined crystal structures and their corresponding match within the CSP sets. A histogram of RMSD30 (Fig. 6) shows that geometric agreement is generally very good: RMSD30 is below 0.4 Å in 78.9% (816) of matches. As a visualisation of this level of agreement, Fig. 7 shows an overlay of the X-ray determined crystal structure of (1aR,2aS,5aS,5bS)-perhydro-4H-oxireno(3,4)cyclopenta(1,2-b)furan-4-one62 (CSD reference code SIBJIX) and the predicted global lattice energy minimum, with RMSD30 = 0.393 Å. As a reference for these values, consider the RMSD30 between structural determinations of the same crystal structure at ambient and low temperature: RMSD30 = 0.204 Å between neutron diffraction crystal structures of naphthalene at 5 K and 295 K, and RMSD30 = 0.160 Å between 20 K and 330 K crystal structures of form I paracetamol.63 327 matches have RMSD30 below 0.204 Å, i.e. have geometric deviations that are of a magnitude that can be explained by the temperature-free nature of structural optimisations used in CSP. While known crystal structures are reproduced very well by CSP in most cases, there are a small number where agreement is less satisfactory: in 32 cases (3% of structures), RMSD30 > 1 Å. Despite what might be assumed, we find no significant correlation between the RMSD30 of the experimental match and the molecular RMSD of the parent gas-phase conformer used for CSP. Assuming an experimental match is found, the geometric quality of the match is only weakly sensitive to the difference in the molecular conformation used for CSP; even the most extreme outlier in molecular conformational change (the aforementioned DOBYOJ) achieves a reasonable geometric match to the experimental crystal structure, with RMSD30 = 0.654 Å.
It is evident that an overwhelming proportion of such rigid molecules can successfully be treated with the CSP workflow implemented here. Our sampling procedure followed by a cost-effective, approximate minimisation method successfully locates the vast majority of observed crystal structures with excellent geometric agreement with the experimental structure, and routinely ranks them as among the most stable structures on the landscape.
One missed match was for the molecule 4,8b-dihydropyrrolo[3,4-b]indole-1,3(2H,3aH)-dione (CSD reference code BUVGAC), which displays a large deviation in molecular geometry between the observed crystal and the gas phase optimised geometry used in CSP. Flexibility in the fused ring system allows the molecule to “fold up” in the gas-phase optimisation, sufficient to change its packing behaviour. As a result, the experimental crystal structure is not a minimum on a CSP landscape derived from the gas-phase conformer.
In the case of DNNEPH10 (1,8-dinitroso-naphthalene), we fail to find the experimental structure despite an apparently rigid, planar molecule that changes very little in the gas-phase optimisation. However, even optimising the known crystal structure using our FIT+DMA energy model results in a final structure that does not match, i.e. the experimental structure appears to be unstable at our level of theory. This may indicate a failure of our energy model for a case of somewhat unusual chemistry.
Two of the missed matches are unusual cases in which a refcode “stem” (the initial six letters, typically shared by a “family” of multiple CSD entries of the same species) is used by crystal structures containing distinct chemical bonding arrangements. Our procedure takes a single representative of a CSD refcode family as the source for the molecular connectivity, which was assumed to stay fixed. In the case of refcodes XUGHUD/XUHGUD01, there is a tautomeric difference, which led to no match with XUHGUD01. For IHEPUG/IHEPG02, the subject molecule is a diastereomeric fused ring system, which exists in an anti configuration in IHEPUG, but a syn configuration in IHEPUG02. It is arguable that these crystal structures should not be considered part of the same “family”, as these distinct bonding arrangements are not interconvertible. CSP was performed with the isomer found in IHEPUG, so no crystal structure matching IHEPUG02 was located. These two systems demonstrate a shortcoming of our approach, in that we assumed that a given CSD refcode “stem” always denotes the same molecular connectivity, including protonation states. Fortunately, these are the only instances of this assumption failing in the entire set.
Our final missed match occurs for QIBCEK, benzo(f)phthalazin-4(3H)-one, another rigid planar molecule. However, upon inspection, we posit that there are flaws in the experimental determination of this structure as held in the CSD – there are extremely close hydrogen contacts (<1.4 Å) and an unsatisfied potential hydrogen-bonding arrangement despite 1:1 donor–acceptor availability. While we retain it as a missed match for the purposes of conservatively assessing our CSP method's performance, we also emphasise that such a large-scale, unbiased workflow has potentially identified an incorrect experimental structure simply through its absence from the CSP landscape.
We preface our analysis by emphasising that the space group frequencies presented in Fig. 8 are those for the CSP landscapes, i.e. only those of structures generated using an asymmetric unit containing a complete molecule (in these single-species systems, Z′ = 1) and without detecting and assigning additional symmetry after the minimisation. In contrast, those of the observed crystal structures in the CSD are the full, maximal-symmetry space group (allowing fractions of molecules in the asymmetric unit, i.e. Z′ ≤ 1). Hence the space groups enumerated for the CSP set can be thought of as “lower bounds” on the symmetry – higher-symmetry space groups may be assignable if the molecules present internal symmetry.
Fig. 8 The relative frequencies of space groups of crystal structures. Grey (top left) are those of rigid molecules in the CSD (as in Fig. 3), while the rest are obtained from CSP landscapes at the global density maximum (purple, top right), at the global energy minimum (blue, bottom left), and within 7.2 kJ mol−1 of the global energy minimum (orange, bottom right). The ordering of space groups on the x-axis is chosen to match that in Fig. 3. Only the 20 most common space groups for rigid molecules are presented for clarity. |
The global lattice energy minimum for each molecule is the energetically preferred packing; the distribution of space groups among the global minima structures are highly consistent with the observed statistics – perhaps as expected, as CSP has been demonstrated in this work to often identify observed structures of these molecules as global minima.
Space group frequencies are often explained by close packing arguments: the commonly observed space groups have combinations of symmetry elements that facilitate close packing of irregular shapes.64 Examining the space group distribution amongst the densest predicted crystal structure for each molecule gives a similar distribution to that from global energy minima; minimising energy and maximising density lead to similar space group preferences. However, there are differences after the three most popular space groups: the next few have almost equal frequencies among high density structures, suggesting that they are equally good at promoting close packing and that observed differences between these space groups relate to subtler influences of symmetry on lattice energy.
Considering hypothetical structures higher on the CSP energy landscape (up to the usual energy limit of polymorphism, 7.2 kJ mol−1, in the bottom right panel of Fig. 8), we see a further flattening of the distribution, and an over-representation of space group 15 (C2/c) compared to global energy minima structures. These changes in distribution with energy, which we do not examine in deeper detail here, are only available from access to complete energy landscapes and are relevant in the discovery of high energy, metastable materials, which have sometimes been observed to have attractive properties.24,65 Thus, we feel that large datasets of CSP landscapes hold potential for generalising our understanding of symmetry preferences in molecular crystals.
In Fig. 9, we show the difference in stability and density between the energy minimum across all Sohncke (i.e. enantiopure) space groups and the minimum across enantiogenic (i.e. racemic) space groups for the 356 molecules in our set containing at least one chiral centre. This energy difference represents the propensity for spontaneous resolution of a racemic solution compared to a racemic crystal of both enantiomers.
In general, there is a slight but consistent lattice energy penalty to enantiopurity – the average difference in lattice energy between the enantiopure global minimum and the racemic global minimum is 2.7 kJ mol−1, favouring the racemate. Of the 356 chiral molecules, racemic crystallisation is preferred in 86% (305) of molecules and spontaneous resolution is predicted to occur for approximately 14% (51) of molecules, although those with small lattice energy differences could be influenced by thermal contributions that are not included here. Moreover, this is accompanied (or perhaps driven) by improved packing in the racemate – on average, optimal enantiopure structures are 2.3% less dense than optimal racemic structures, consistent with the empirical Wallach's rule.67,68
The performance of the resulting model on the test set shows remarkably low error (Fig. 10), returning an MAE of just 0.93 kJ mol−1. Moreover, a similarly low MAE of 1.57 kJ mol−1 is achieved on the extrapolation test set, which contained crystal structures of compounds not included in the training of the correction. Compared to the errors for the baseline FIT+DMA, which returned MAEs of 7.80 and 7.95 kJ mol−1 on the test set and extrapolation set respectively, the correction offers a marked improvement in accuracy. The fact that the errors for the correction are slightly higher on the extrapolation set shows that there are limitations to the transferability of the correction. We expect that better transferability can be achieved as we increase the number of CSP landscapes available for training; the generation of additional landscapes can be targeted to weaknesses in the underlying force field and molecular types where the current machine learned correction has large errors.
Although the performance of the correction on the test sets is encouraging, an important question is whether the improved energies are significant enough to produce improved stability rankings of organic crystals. To evaluate the influence on ranking, we applied the correction to all of the CSP landscapes, re-ranking all the predicted structures based on the corrected energies. Thereafter, we compared the ranking of the matches to the experimental structures in terms of both their ranking and their energy above the global minimum with that found using FIT+DMA. Structures with predicted uncertainties greater than 25 kJ mol−1 were omitted from this analysis. This amounted to 257 structures (out of 3.9 million total) with examination indicating the structures were predominately high energy structures containing voids.
The resulting distributions (Fig. 11) over both the training and extrapolation compounds illustrate clearly that the corrected energies in general improve the rankings of the experimental structures. For instance, the correction results in an increase from 424 (FIT+DMA) to 501 (FIT+DMA+Δ-ML) of the observed structures ranked as global energy minima, and an increase from 767 (FIT+DMA) to 839 (FIT+DMA+Δ-ML) within 2 kJ mol−1 of the global energy minimum (the approximate limit of vibrational contributions to free energy differences). These improvements are observed in the CSP landscapes of the extrapolation and training molecules (see ESI†). Importantly, the correction also greatly improves the worst ranked experimental matches with the proportion ranked above 35 in energy ranking decreasing by over a third, from 69 (FIT+DMA) to 45 (FIT+DMA+Δ-ML) (see ESI†).
Fig. 11 Histogram of the relative energies of matches to the experimentally-determined crystal structures over all CSPs before (FIT+DMA) and after (FIT+DMA+Δ-ML) applying the data-derived B86bPBE+XDM lattice energy correction. Separate distributions for molecules from the training and extrapolation sets are shown in the ESI.† |
The FIT+DMA+Δ-ML model is not suitable for this task because it is only a correction to the intermolecular contribution to the lattice energy. Therefore, using a dataset derived from the CSP structures selected for the energy correction model, with perturbed atomic coordinates to sample conformational degrees of freedom, we trained a total energy MACE (higher order equivariant message passing neural network) model. Full details are provided in the ESI.† The trained MACE model was then applied to geometry optimise the predicted structures of 15 compounds with large differences in molecular geometry between the observed crystal structure and the DFT-optimised molecule (see Fig. 1, bottom three rows).
Re-optimisation with the trained MACE model yielded considerable improvements in the geometric agreement of predicted structures with experiment and of their energy ranking on the CSP landscapes (Table 2). RMSD30 between experimental and predicted structures decreased upon re-optimisation for 14 of the 15 compounds, by up to 1.4 Å, and moved the match to the observed structure closer to the global energy minimum in all but 2 cases, with 7 becoming the global energy minimum structure. Fig. 12 illustrates the improved geometric agreement for one of these molecules.
Crystal structure | FIT+DMA | MACE | ||
---|---|---|---|---|
RMSD30 (Å) | ΔE (kJ mol−1) | RMSD30 (Å) | ΔE (kJ mol−1) | |
ATCDEO | 0.341 | 7.26 | 0.125 (−0.216) | 6.67 (−0.59) |
BIRTUR | 0.396 | 6.06 | 0.391 (−0.005) | 0.02 (−6.03) |
BIXKUP | 0.813 | 16.32 | 0.208 (−0.605) | 0.00 (−16.32) |
BIXLOK | 1.048 | 4.95 | 0.265 (−0.783) | 0.00 (−4.95) |
DALBIC | 0.343 | 1.45 | 0.206 (−0.137) | 3.70 (+2.25) |
DEBFOI | 1.208 | 4.18 | 0.110 (−1.098) | 0.00 (−4.18) |
DOBYOJ | 0.654 | 1.27 | 0.640 (−0.014) | 3.55 (+2.28) |
JASGIT | 1.837 | 15.77 | 1.776 (−0.061) | 9.72 (−6.05) |
KOGFER | 0.247 | 0.37 | 0.248 (+0.001) | 0.00 (−0.37) |
QEKQIG | 0.824 | 4.84 | 0.139 (−0.684) | 0.00 (−4.84) |
TAVTUH | 1.751 | 12.03 | 0.325 (−1.426) | 1.87 (−10.16) |
TUNWUV | 1.103 | 9.55 | 0.399 (−0.704) | 0.00 (−9.55) |
UDEXUZ | 0.725 | 7.41 | 0.216 (−0.509) | 0.75 (−6.66) |
WACYAB | 0.721 | 9.81 | 0.216 (−0.505) | 0.92 (−8.89) |
WACYEF | 0.975 | 14.02 | 0.200 (−0.775) | 0.00 (−14.02) |
Re-optimisation of CSP structures was also run for 4,8b-dihydropyrrolo[3,4-b]indole-1,3(2H,3aH)-dione, where no match to the experimental crystal structure (CSD reference code BUVGAC) was identified in the CSP. However, no match was identified after re-optimisation with the MACE model; in this case, molecular flexibility would have been required during structure generation, rather than post-CSP re-optimisation. Now that a transferable MACE model has been trained, it could potentially be implemented earlier in the CSP workflow.
It should be noted though that the MACE model is a semi-local model and so neglects long-range interactions that can be important for highly accurate relative energies. Overall, the results of the unconstrained optimisations confirm that the high relative energies of these crystal structure matches estimated by FIT+DMA+Δ-ML are largely a result of limitations with the rigid-body lattice energy minimised geometries. It is worth reiterating that the limitations of the CSP geometries only affected a small number of structures; in the vast majority of cases, as shown earlier, the predicted structures at the FIT+DMA level achieved high quality matches to the experimentally-determined crystal structures. Undoubtedly, the success of the FIT+DMA+Δ-ML energy correction is a reflection of the performance of the baseline empirical force field.
We have assessed the quality of the dataset by evaluating its reliability at predicting the known crystal structures of these molecules, both in terms of the quality of the geometric match of the crystal structures and the resulting energy ranking of the experimental forms on their respective landscapes compared to other hypothetical structures. Our CSP approach is overwhelmingly successful at predicting crystal structures of these simple molecules – over 99% of all experimental structures have a match located in our CSP searches. 41% of experimental structures are predicted to be the global energy minimum on their landscapes, and 74% are found within 2 kJ mol−1 of this minimum, a margin equivalent to the estimated error introduced by ignoring thermal effects and ranking based solely on static, 0 K lattice energies. Geometrically, the typical discrepancy between experimental structures and their closest predicted matches is comparable to the thermal fluctuations in experimental crystal structure solutions of the same solid form obtained at low temperatures versus ambient conditions. Such remarkable performance demonstrates the consistency and accuracy of our chosen methods for optimising and ranking these structures.
Such a large dataset of many possible crystal packings should prove a valuable resource for identifying a variety of crystal packing trends, and we make this data available to the community as part of this work. Herein, we studied space group distributions among low-energy hypothetical structures compared to those observed in the CSD, and find substantial overlap, particularly among the most common space groups. We also demonstrated the potential of this varied dataset to explore chirality in the organic solid state, finding very good agreement with established empirical rules concerning the propensity of racemic mixtures to crystallise in racemic crystal structures as opposed to separate enantiopure crystals.
Additionally, we have shown the power of such large-scale CSP to train transferable machine learned potentials for organic solid-state systems. A committee neural network potential trained on single-point periodic DFT lattice energies achieved excellent accuracy in correcting our force field energy landscapes to the DFT level, reducing energy errors by approximately 8 fold. While the NNP performance is reduced on molecules reserved as an extrapolation set compared to those seen in training, the potential still demonstrates improvements to the quality of the resulting CSP rankings increasing the number of experimental structures ranked as the global energy minimum by 18% overall. We further demonstrated the development of a transferable MACE potential using structures derived from the CSP landscapes to allow re-optimisation of crystal structures, testing it successfully on those molecules in our set where the molecular geometry distortion between the conformation in the known crystal structure and the gas-phase-optimised one was largest. The results showed improved structural agreement with experimental structures in almost all cases and much improved energy rankings, moving several poorly-ranked observed structures to the global energy minimum.
Using these mature, well-tested CSP methods alongside modern machine learning approaches, we have demonstrated the ability of CSP to create very large, diverse datasets of hypothetical crystal structures, and the utility of this information in both understanding broad trends in organic crystal structures and in training more advanced energetic models for refinement and transferability. It is our hope that the variety and quantity of CSP data presented here, alongside our demonstrations of possible applications, enables the greater organic solid-state computational community to develop even more sophisticated models and techniques in pursuit of truly predictive, computational data-driven discovery of novel materials.
Footnotes |
† Electronic supplementary information (ESI) available: Computational details of CSP and ML calculations, summary of molecule set including CSD refcodes, and summary of CSP matches to experiment. See DOI: https://doi.org/10.1039/d4fd00105b |
‡ Complete CSP landscapes and ML models are available from https://doi.org/10.5258/SOTON/D3094. |
This journal is © The Royal Society of Chemistry 2024 |