Open Access Article
Andrea
Anelli
*a,
Hanno
Dietrich
b,
Philipp
Ectors
c,
Frank
Stowasser
a,
Tristan
Bereau
d,
Marcus
Neumann
b and
Joost
van den Ende
a
aRoche Pharma Research and Early Development, Therapeutic Modalities, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland. E-mail: andrea.anelli@roche.com
bAvant-garde Materials Simulation Deutschland GmbH, Merzhausen, Germany
cPharma Technical Development, F. Hoffmann-La Roche Ltd., Basel, Switzerland
dInstitute for Theoretical Physics, Heidelberg University, Heidelberg, Germany
First published on 4th October 2024
We accelerate a key step in crystal structure prediction (CSP) using machine learning and report its robustness on a wide array of pharmaceutical molecules. The speedup achieved by our scheme allows for a scale-up in both the number of candidate drug molecules studied and the level of theory employed in their treatment, paving the way for tackling more complex crystal energy landscapes.
Successful CSP requires accurate energies and a thorough sampling of the crystal energy landscape. Traditionally, this means striking a trade-off between the accuracy needed to capture the relevant physical interactions and efficiency in being able to scan large pools of candidate structures.5,6 For this reason, CSP is often carried out in a hierarchical fashion: starting from less accurate though fast force field measures of stabilities, one finds a promising subset of configurations on which more accurate calculations are performed. Since the seminal paper of Gavezzotti in 1994 (ref. 7) many developments around CSP have taken place as is, for example, reflected by the Sixth Blind Test on Organic Crystal Structure Prediction Methods8 and its predecessors. An essential step to enable robust and efficient1 generation of crystal packings has been the introduction of tailor-made force fields (TMFF).9 To gain precision in ranking the generated packings, the usage of dispersion corrected density functional theory (DFT-D) has been proven to be successful.10 A more recent advance has been the incorporation of hybrid functionals, higher order dispersion corrections and approximations to the vibrational free energy5 in order to estimate the effect of temperature on the relative stabilities of polymorphic forms. A correction of the conformational energy at the post-DFT level has been shown to be important in the highly polymorphic ROY case.11 All these improvements have recently been combined and assessed against a carefully selected experimental benchmark.12 While the accuracy of theoretical frameworks has increased, so has the computational footprint needed to leverage them. Atomistic machine learning has shown promise in accelerating the predictions of the stabilities of atomic13–15 and molecular16 materials at various levels of theory,17–19 though, to the best of our knowledge, its impact on the efficiency of CSP applied to pharmaceutical compounds has not yet been reported.
In this communication, we present a scheme to integrate an atomistic machine learning model into an archetypical CSP workflow which bridges the generative step and the ab initio ranking process. By constructing an ML-reranker we enable a more efficient and modular (e.g. by substitution of different acquisition functions or ML models) screening procedure. The core of our approach is a regression model mapping the generated force field minima structures to the energies they would obtain upon minimization on a reference ab initio potential energy surface. This minima predictor is built as a delta model baselined on force field energies and contains a measure of confidence indicating how likely the structures are to fall close to the target prediction.
We show that our approach accurately selects all the configurations which populate the low-energy DFT minima basins, and provides a reliable uncertainty estimate. In our approach, we focus on atom-centered representations. This characteristic allows our model to express confident predictions on a size extensive basis, making it reliable across the broad structural space of the search, and increasing Z′ numbers, irrespective of the cell sizes.
In order to produce the crystal structure landscapes for our case studies, we use the GRACE 2.7.49 (ref. 20) software package. Initially, a tailor-made force field (TMFF) is constructed and is used to perform a thorough scan of the structure's crystalline configurations, acting as our generative step. The parameterization of each TMFF is carried out independently for each drug molecule, so as to adapt to the different molecular conformations, bonded and non-bonded interactions governing the system of choice. The CSP takes place by using the most commonly occurring space groups populating Z′ = 1, 2. Following the generative step, we have access to a force field energy ranking of the different crystal structures. To refine the CSP results, we perform geometry optimizations on the number of structures needed to sample the population contained within a target energy window Ewindow. The reranking of low energy TMFF structures is obtained by performing PBE–DFT energy minimizations using the Neumann Perrin dispersion correction and the software package VASP21,22 for the calculation of DFT energies and forces. The ranking obtained by this last set of optimizations is considered as the reranked landscape. This procedure aims at capturing the relevant minima of the crystal energy landscape and can be followed by a finite temperature correction step to account for entropic contributions and further ab initio minimizations at higher levels of theory. The objective of this work focuses on the reranking of the zero kelvin landscape, and as a result, no finite temperature sampling is considered in this manuscript.
To construct a predictor of the lattice energies we train a ridge regression model on average SOAP23 power spectra per TMFF crystal packing. We chose this representation because of its wide success in the field of molecular crystal property predictions and transparency of its feature space.24,25 The advantage of such a simple scheme is the absence of any hyperparameter besides the Tikhonov regularization term, which is optimized with an 8-fold cross-validation split at each training step. We additionally use a delta learning scheme,26 predicting the differences between the TMFF and the DFT energies, to reduce the variance of the target property. Finally, by constructing a committee of models following Imbalzano et al.,27 we introduce a confidence interval into each prediction. For details on the implementation of the ML model, we refer to the ESI† in Sections S1 and S2. The machine learning model is used to guide the selection of the structures that exhibit promise of being low in energy upon an ab initio minimization. After completing the generation step in the CSP routine, we follow an active learning approach to iteratively train a minima-to-minima machine learning model mapping the energy of the TMFF minima to their corresponding DFT minimized counterparts. The scheme follows the workflow proposed in Fig. 1. To initialize our model, we first minimize the number of candidate structures, Ninit, covering the generated space uniformly using a farthest point sampling algorithm,28 and training a committee of ridge regression models on them Fig. 1a. We use this ensemble to predict the energy change the structures would undergo after a DFT minimization. The energy predicted is the mean of the ensemble distribution and the estimated uncertainty is its standard deviation.
The model is trained on the first 80% of the sampled structures, while the remaining 20% is used to evaluate its performance. The error distribution and accuracy of the trained models can be visualized independently as shown in Fig. 1b. To select a new configuration, we leverage an exploitation–exploration selection criteria following the expected improvement function proposed in ref. 29, and apply it to perform a batch selection of Nbatch points at every iteration from the generation pool. The details on the implementation of the batch selectors are provided in ESI† Sections S2 and S3. Finally, we sum all the probabilities of each configuration of having energy E and estimate the probability of having left in the generation set (i.e. not sampled yet) a configuration with an energy lower than Ebest + Ewindow, with Ebest being the current lowest energy DFT structures, and Ewindow the target energy window to sample. To estimate this probability, we calculate the integral of the not-yet-sampled generation set's energy cumulative probability distribution until the desired window, as shown in Fig. 1c. This number provides us with the expected number of configurations left which could be below the window: Nleft. We thus expect a total number of configurations defined as Nexpected = Nleft + Nfound, where Nfound is simply the number of structures found so far with DFT energies within the desired window. We introduce a convergence probability as pconv = Nfound/Nexpected. If the probability is higher than a desired confidence pthreshold, we consider our TMFF basin exhausted, and the corresponding DFT landscape satisfactorily surveyed. A more detailed description of the convergence criteria is provided in ESI† S3.
The first molecule of our study is fentanyl. The proposed compound has 54 atoms and consists of 6 rotatable bonds, four chemical species (H, C, N, and O), two rings, and a molecular weight of 336.5 g mol−1. The central plot of Fig. 2 shows the energy-density scatter plot for the reference CSP carried out with GRACE (in mint green). The low energy structure appearing as a ground state is competing with several other configurations within the target energy window, indicating a rich crystal landscape. The CSP workflow as implemented in GRACE found 10
342 low energy packings using a TMFF with a convergence threshold of 0.99 and 0.95 for structures having one (Z′ = 1) and two (Z′ = 2) independent molecules in their asymmetric units, respectively. The unpublished standard GRACE reranking algorithm converged the sampling of a window of 1 kcal mol−1 by optimizing a total of 2345 structures with PBE–DFT. To showcase the effectiveness of our minima to minima mapping, in the central plot in Fig. 2 we report the crystal structure landscape obtained by the 16th step of our approach, compared to the one obtained by a standard GRACE reranking. The comparison is performed by coloring each configuration found on the GRACE landscape depending on whether its originating force field configuration had been sampled by the ML-reranker or not. Circles in mint green indicate agreement between the two methods, while circles in light grey show a structure that only GRACE found. Notably, within this exercise, the ML-reranker did not find one configuration that GRACE did not sample – showcasing the robustness of the sampling implemented in GRACE. The two reranking datasets and the generation pool are available on Zenodo.30
The ML-reranker was initialized using Ninit = 100 structures extracted from the TMFF generation pool, and using Nbatch = 105 for each iteration. The exploitation energy window was set to Ewindow = 1 kcal mol−1 and target residual probability to pthreshold = 0.99. Both GRACE and the proposed approach capture the ground state crystal structure, and sample in a comparable way the target energy window. The improvement of the model and its bias in selecting configurations exhibiting low energy is shown in the right scatter plot in Fig. 2, where a correlation plot between the TMFF sampled energies and their ML predictions is shown. The correlation plot shows low errors for points below the energy window – mostly lying on the diagonal (black points showing a zero error correspond to structures that have been sampled, thus having zero uncertainty in their predicted value). Out of 174 structures in the target energy window, only one has not been found by the ML-reranker in agreement with the value of pthreshold configured to select 99% of the structures within the desired window contained in the initial generation set. When looking at the results from a data-efficiency perspective, this translates to 845 DFT minimization spared, corresponding to 34% savings on the reranking computational cost. From a wall-clock perspective, based on an estimated average optimization time per structure, this amounts to roughly 3000 node-hours using 96 cores.
To obtain a picture of the general applicability of our approach, we report the results on 17 compounds from Roche's portfolio. For each of these molecules, a complete reranking of TMFF structures using the standard GRACE reranker resulted in a subset of reranked configurations. These selections contain configurations spanning up to 10 kcal mol−1 above the ground state and constitute a smaller yet challenging dataset. Differently from the previous exercise, we will test the efficiency of our approach in extracting the lower energy structures with respect to their reranked sets. While this example is not directly predictive of the performance of the model on a realistic reranking scenario (as the reranked landscapes are smaller and more homogeneous), we show in Table 1 how this defines an upper bound for the performance gain that can be expected from this novel approach within this simplified context. The extended compound selection exercise serves the purpose of showing the applicability of our reranking approach to a wide variety of complex chemical spaces, with an average molecular weight of 500 g mol−1, 7 rotatable bonds, and 6 rings. The summary of pharmacological metrics is shown in Fig. S4 in the ESI.† The performance of the ML-reranker is consistent with what we have observed in the fentanyl case. We remark that this metric serves the purpose of showing the ML-reranker robustness in sparse data regimes but does not constitute a fair comparison with GRACE's standard algorithm. GRACE had to select from all the TMFF crystal structures whereas the ML algorithm only selects from a smaller set of structures that were initially chosen by GRACE in the standard reranking and then replaced by their corresponding TMFFs for the computational exercise presented here. The parameters used across the API benchmark are kept constant at Ninit = 100, Nbatch = 100, Ewindow = 1 kcal mol−1 and residual probability of pthreshold = 0.99. A summary of the results obtained across the whole set is shown in Table 1. This analysis shows how reliable the estimation of the convergence probability is throughout the set of different molecules. By targeting a 99% convergence across a total of 475 structures within all the windows, the ML-reranker fails to select 2 packings corresponding to an observed convergence probability of 99.6%. Further, our method reliably accelerates the sampling across the complex chemical space surveyed, as shown in the forward exercise – and the similar performances observed in the reranked dataset. Thirdly, we demonstrate how a sub 1 kcal mol−1 ML energy model can be trained efficiently to obtain reductions of minimization rounds at no additional cost.
| ID | N tot | N sampled | N below | N found | E min | RMSE |
|---|---|---|---|---|---|---|
| RO1 | 3149 | 1110 | 12 | 12 | — | 0.71 |
| RO2 | 905 | 706 | 26 | 26 | — | 0.83 |
| RO3 | 1186 | 403 | 3 | 3 | — | 0.93 |
| RO4 | 1606 | 504 | 3 | 3 | — | 0.85 |
| RO5 | 1796 | 1110 | 21 | 21 | — | 0.68 |
| RO6 | 1650 | 908 | 106 | 106 | — | 0.50 |
| RO7 | 2535 | 1413 | 8 | 8 | — | 0.88 |
| RO8 | 1678 | 908 | 5 | 5 | — | 0.66 |
| RO9 | 2307 | 1514 | 3 | 3 | — | 1.01 |
| RO10 | 856 | 807 | 7 | 7 | — | 0.62 |
| RO11 | 857 | 807 | 16 | 16 | — | 0.84 |
| RO12 | 1766 | 1009 | 9 | 9 | — | 0.85 |
| RO13 | 2852 | 1413 | 12 | 11 | 0.98 | 0.72 |
| RO14 | 1481 | 1110 | 5 | 5 | — | 0.86 |
| RO15 | 1405 | 1009 | 1 | 1 | — | 0.97 |
| RO16 | 1009 | 504 | 3 | 3 | — | 1.02 |
| RO17 | 5034 | 908 | 95 | 94 | 0.76 | 0.43 |
| FTN | 2523 | 1110 | 140 | 140 | — | 0.37 |
| Total | 34 595 |
17 253 |
475 | 473 | 0.76 | 0.76 |
Extending the range of CSP to larger and larger molecules requires constant improvements of the various components forming a complete CSP workflow. In this work, we have investigated the performance of an ML-reranker based on standard ML approaches as an alternative to the unpublished GRACE reranker based on statistical correlations and detailed domain knowledge. The results of this investigation prove that this strategy offers substantial improvements in the CSP reranking exercise without incurring any additional cost, due to a simple efficient re-use of the data explored during the landscape explorations. The ML-reranker is an appealing component for anyone who wants to construct a CSP workflow from scratch and offers potential improvements in already honed architectures. The ML-reranker (i) reduces the number of ab initio crystal structure optimizations by the on-the-fly improvement of an ML surrogate model and (ii) offers an iterative sampling scheme with robust convergence behavior determined by a single threshold parameter.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4ce00752b |
| This journal is © The Royal Society of Chemistry 2024 |