Open Access Article
Chengxi Zhao†
*a,
Zhaojia Ma†bc,
Dingrui Fan†a,
Siyu Hubc,
Leping Wangbc,
Feng Huabc,
Weile Jiabc,
En Shao*bc,
Guangming Tan*bc,
Jun Jiang
*ad and
Linjiang Chen
*ae
aState Key Laboratory of Precision and Intelligent Chemistry, Hefei National Research Center for Physical Sciences at the Microscale, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, China. E-mail: Chengxi_zhao@ustc.edu.cn; jiangj1@ustc.edu.cn; linjiangchen@ustc.edu.cn
bState Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. E-mail: shaoen@ict.ac.cn; tgm@ict.ac.cn
cUniversity of Chinese Academy of Sciences, Beijing, China
dHefei National Laboratory, University of Science and Technology of China, Hefei, China
eSchool of Chemistry, School of Computer Science, University of Birmingham, Birmingham, UK
First published on 2nd April 2026
Molecular crystal structure prediction (CSP) faces a persistent computational bottleneck: it requires exhaustive sampling of vast packing landscapes while resolving energy differences of only a few kJ mol−1. We introduce BOMLIP-CSP, an open-source Python framework that integrates machine learning interatomic potentials (MLIPs) with a tailored batched optimization strategy, enabling rapid, unbiased structure prediction across the full crystal density range. By introducing tailored parallelism into modern MLIPs, BOMLIP-CSP achieves a ∼2.1–2.3× acceleration in large-scale CSP searches without compromising accuracy. In benchmarks covering 34 experimental structures from six CSP blind tests, more than 50% of the experimental crystal structures can be recovered using foundation MLIPs when the correct space group and Z′ are included in the search, with more than 70% of the experimental structures recovered by at least one of the tested MLIPs under the present benchmark conditions. Importantly, our results suggest that MLIPs with comparable equilibrium-energy accuracy can yield strikingly different CSP outcomes, indicating that predictive success may depend not only on local energy fidelity but also on how the MLIP energy surface is organised. Together, these results establish BOMLIP-CSP as a broadly accessible platform for accelerated CSP and provide new insight into the interplay between MLIP characteristics and crystal structure discovery.
CSP has attracted significant attention in materials science due to its importance and the challenges posed by its complex workflow and high accuracy requirements.9–12 It involves an extensive search for local minima on a vast crystal lattice energy landscape, followed by ranking candidate structures whose energy differences are often within only 1 to a few kJ mol−1.13–15 This extreme sensitivity imposes stringent accuracy requirements, while the vast configurational space demands high computational efficiency, thereby creating a persistent bottleneck that has long constrained the field. Furthermore, despite its potential and significance, organic molecular CSP remains dominated by proprietary software, with only a handful of open-source implementations available,9,11,16 limiting accessibility and reproducibility for many research groups worldwide.
The blind tests organized by the Cambridge Crystallographic Data Centre (CCDC) over the past two decades have witnessed remarkable progress in CSP capabilities.17–24 Recently, machine learning (ML) has emerged as a versatile tool for addressing high-dimensional chemistry problems.25,26 Beyond aiding decision-making across the CSP workflow,23 machine learning interatomic potentials (MLIPs), which capable of achieving near ab initio accuracy at computational costs comparable to classical force fields,27–29 offer a promising route for CSP tasks that demand both accuracy and efficiency. More recently, the emergence of foundation models30–32 has further transformed this landscape by reducing or even eliminating the need for time-consuming data generation, model training, and active learning cycles.
Several studies have begun exploring the use of MLIPs to perform prediction problems.33–35 Such research typically focuses on evaluating the accuracy of the overall CSP strategy, with the structures often first pre-optimized using other methods.16,36 While these tests are undoubtedly important, the community still lacks a clear assessment of how different MLIPs directly influence CSP outcomes when all other steps in the workflow are held constant. In particular, it remains unclear how these MLIPs perform when the initial trial structures—owing to the lack of prior knowledge of the experimental structure—are largely displaced from their local minima, a scenario that is more representative of practical structure prediction tasks. This also raises another critical issue: the need for a software framework that, tailored to the characteristics of MLIPs, can perform CSP calculations in a simple and fast manner.
Here, we have developed an open-source code, Batched Optimization with Machine Learning Interatomic Potentials Crystal Structure Prediction (BOMLIP-CSP). This code addresses the issue that CSP codes are largely dominated by commercial licenses, leaving researchers in many materials-focused groups with very few accessible options for performing CSP calculations. This code is designed to perform the crystal structure prediction workflow across the full range of crystal densities using MLIPs with ease. We developed a batched structural optimization module tailored to MLIP-based CSP calculations. In the present BOMLIP-CSP implementation, this module markedly accelerates computations and enhances efficiency in large-scale tasks, delivering an observed overall ∼2.1–2.3× speed-up. This strategy balances computational speed and accuracy, delivering excellent precision under fast calculations, with performance that can be flexibly tailored according to the chosen MLIPs. Two widely adopted foundation atomistic models have been preconfigured within the accelerated batch optimization module for immediate use, and the framework is readily extensible to other methods.
Furthermore, we conducted a benchmark study by applying different MLIPs to optimize crystal structures generated from the same trial structure database. The test set was based on 26 target systems from the first to the sixth CSP blind tests, along with their polymorphs sharing the same asymmetric unit found in the CCDC database yielding a total of 34 crystal structures. When the correct space group and Z′ are present in the search space, even relatively small MLIPs recover more than 50% of these experimental structures at least once, and in many cases the corresponding hits are ranked near the bottom of the energy list. Across the tested MLIPs, more than 70% of the experimental structures were recovered under the present benchmark conditions. Instead, the aim of this benchmark study is not to highlight the ultimate capability of BOMLIP-CSP or to serve as a definitive comparison of atomistic models, but to provide a systematic evaluation of the impact of different MLIPs on CSP under identical workflow conditions. Within this controlled setting, the results suggest that our framework can achieve a favourable balance between computational efficiency and predictive accuracy, and that judicious MLIP selection may help reduce the number of trial structures required in practice.
In our code, an initial coarse crystal structure with a large unit cell length was generated, with its unit cell parameters, the centroid of the asymmetric unit, and its orientation determined by a vector sampled from the Sobol sequence. Then, the space group symmetry operations were applied to the asymmetric unit to construct the complete crystal structure. After that, each unit cell length was gradually reduced until the intermolecular distances within the cell dropped below a predefined threshold. To ensure structural reasonableness, all generated structures must satisfy three criteria, which are enforced or evaluated at different stages of the generation process. Collectively, these criteria constitute what we refer to as the triple-radius validation scheme (Fig. 1b).
First, a covalent radius39 factor of 1.15× was applied to ensure that each molecule is complete, with all covalent bonds correct. Second, van der Waals (vdW)40 radii scaled by 0.9× were used to identify and exclude structures with excessively close intermolecular contacts. Third, to eliminate physical unreal structures, a spherical radius equal to 1.5× the vdW radius of the largest atom was applied around each atom; a valid trial structure must form a three-dimensional framework within this criterion. This ensures that unphysical slab-like arrangements (Fig. S1), which either waste optimization time or lead to meaningless results, are excluded, while preserving potentially porous yet realistic structures that may correspond to experimentally observed crystals or be essential to a complete energy landscape. This process ensures that each generated trial crystal is of high structural quality, enabling efficient and comprehensive exploration of the energy landscape without wasting computational resources. Moreover, it reduces the risk of missing packing motifs due to sampling bias like porous structure. For example the well-known molecule T2,7 exhibit porous structures that are challenging to predict using density-based close-packing generation methods.
Traditional parallelization approaches, such as multiprocessing, handling individual structures concurrently, suffer from inherent inefficiencies. In this paper, we present a high-performance CSP relaxation framework specifically designed to integrate multiple mainstream MLIP backends efficiently. The core innovation of our framework is performing fine-grained batch processing strategies by grouping multiple distinct candidate structures into a computing batch. Thus, the neighbour list updating, the quasi-Newton method optimizing, and the energy/force evaluation can be performed simultaneously on all candidates within a batch using tensor operations. This leverages the massive parallel processing capabilities of GPUs and other accelerators most effectively. Besides, our framework employs an intelligent continuous batching scheduling strategy to maintain a global compute pool dynamically. If structures within a batch finish the relaxation early, which means these structures successfully converge to their minimum-energy, the scheduler immediately replaces and pushes unprocessed candidates (until the target atom count threshold is reached) to saturate the utilization of computational resources and maximize the relaxation throughput (see supplementary methods algorithms in SI for details). Our framework demonstrates performance gains in practice, enabling the high-throughput relaxation process for large-scale CSP.
To address this limitation, the developed batched structural optimization method for MLIPs consists of three components: (1) a batch algorithm based on the quasi-Newton method; (2) a batched neighbour list update algorithm compatible with periodic boundary conditions; and (3) a structure optimization scheduling algorithm. The details of these methods are provided in the SI method.
(2), P21 (4), C2/c (15), Pna21 (33), Cc (9), Pca21 (29), C2 (5), P42 (77). The energy landscape, constructed from all optimized crystal structures, is shown in Fig. S2 and is consistent with previously reported energy landscapes. All experimentally observed polymorphs were successfully predicted in this test and are located at energy spikes, a characteristic commonly associated with experimentally accessible porous structures.5,7 No physically unrealistic structures were identified in the resulting landscape. This indicates that the strategy effectively filters out unphysical structures without excluding energetically reasonable porous structures.
The summary of the speedup effect is shown in Table 1. This CSP process across 11 space group ranging in size from 46 to 368 atoms, achieving speedups of 1.95–2.15×, with batch optimization delivering benefits across all problem sizes. Subsequently, we performed an end-to-end test by combining all six datasets in the table, achieving an average speedup of approximately 2.08×. We further evaluated more common space groups with different atom counts to monitor the speedup factor of the batched optimization method. This additional test included datasets ranging in size from 46 to 736 atoms, yielding speedups of 1.78–2.57× and an average speedup of 2.31× (Table S1).
| Number of atoms | Number of crystals | Baseline (s) | Batchopt (s) | Speed-up factor |
|---|---|---|---|---|
| 92 | 1000 | 550.71 | 259.01 | 2.13 |
| 184 | 3500 | 3786.80 | 1764.76 | 2.15 |
| 368 | 1000 | 3355.27 | 1723.71 | 1.95 |
| Mixed | 5500 | 7976.74 | 3835.32 | 2.08 |
Four fundamental challenges were selected to evaluate the performance of each MLIP in the CSP process of a given system. These challenges serve both as a benchmark to compare the effectiveness of different MLIPs within our workflow, and as a case study of applying MLIPs to the specific task of CSP. To the best of our knowledge, benchmarking studies involving a large number of structures far from equilibrium with the same chemical composition and similar energies generated during CSP have not been widely investigated.
These four evaluation criteria include the following: the first criterion is whether the experimental structure could remain stable during optimization. A reliable computational method should be able to preserve the packing mode of the experimental structure; otherwise, accurate crystal structure prediction would not be possible, regardless of how well other parts of the workflow perform. Second, the number of times the experimental structure is successfully predicted—referred to as the hit count—within a fixed number of trial structure optimizations. It directly reflects whether the CSP process has been successful. At the same time, a higher hit count indicates that the algorithm is more likely to locate the experimental structure at an early stage. In searches with a fixed total number of generated trial structures, achieving a higher hit count reduces the risk of missing the experimental structure,6,7,41,43 whereas monitoring search convergence can also be effective but is more complex. The third criterion is the relative lattice energy of the experimental structures, and the fourth is the energy ranking of the experimental structures after duplicate removal. Both criteria evaluate the accuracy of the potential energy calculations and reflect the likelihood of identifying plausible structures through theoretical prediction before experimental validation. Although exceptions exist,5,7 in most close-packed systems the experimental structures or polymorphs lie near the global minimum in lattice energy, typically ranking among the lowest in the energy ordering when evaluated with high-accuracy methods.15,24 Moreover, previous studies have shown that the energy differences between most polymorphs are usually within a few kJ mol−1.13,14
As shown in Fig. 3 (a detailed summary of all the results is provided in Table S2), most experimental crystal structures were successfully stabilized by the MLIPs during geometry optimization, except for two out of the 34 structures that could not be stabilized by any model, see Fig. 3a. These are system III, which contains a boron atom, and system XXII, which may be challenging due to its rare composition without any hydrogen atoms. Each foundation atomistic models was able to stabilize over 85% of the experimental structures (29/34 and 30/34, respectively), with certain systems being stabilized by only one of the two models.
The frequency with which the experimental structure is predicted is a critical criterion, as the energy ranking of candidate structures can be further refined in subsequent calculations using high-accuracy methods such as DFT. However, DFT is unlikely to recover a packing arrangement absent from the initial set of structures generated by the MLIP. In the test involving 2500 structures generated from up to four conformers within the experimental space group, both MLIPs achieved a success rate exceeding 50% in predicting the experimental structure (Fig. 3b). Thirteen experimental structures were predicted by both MLIPs, while twenty-four were predicted by at least one. Eleven experimental structures were predicted by only one of the two MLIPs; in certain cases, despite being stabilizable by both and optimized from exactly the same set of generated trial structures, an experimental structure was recovered by only one model. This demonstrates that the choice of MLIP can affect prediction outcomes. For smaller molecules, the experimental structure hit count is comparatively higher than that of larger molecules due to the significantly smaller packing space. The number of successful predictions for co-crystals as well as flexible molecules with more rotatable bonds is comparatively lower, which is consistent with conventional understanding in CSP. These systems typically require a larger number of trial structures to be adequately sampled.
The relative lattice energy and energy ranking of the experimental structures for each system are shown in Fig. 3c and d. Nearly all experimental structures lie within 25 kJ mol−1 of the global minimum. For MACE–OFF–small, the average relative energy is 8.40 kJ mol−1, with 20 structures below 10 kJ mol; the highest value is 60.11 kJ mol−1 (system XVII), followed by 18.33 kJ mol−1. For SevenNet-0-D3, the average relative energy is 9.28 kJ mol−1, with 15 structures below 10 kJ mol; the highest value is 23.53 kJ mol−1 (system XXIII). For the energy-ranking criterion, 24 and 22 experimental structures rank within the top 100 for MACE–OFF–small and SevenNet-0-D3, respectively. Moreover, 17 and 7 structures rank within the top 5 for the two models. Noted that experimentally observed structures are not guaranteed to appear in the top 5 of the potential-energy ranking, as polymorphism and free-energy corrections under specific external conditions can substantially alter their relative stability.
Overall, these performances are relatively solid in exploring the energy landscape of CSP, and they can also serve as a key component of the structure generation stage in a hierarchical strategy, which is widely used15,44,45 and typically follows with DFT or higher-accuracy calculations for pre-experimental prediction.
Guided by the Exp. Stability metric (Fig. 3a and Table S2), failures can be broadly attributed to two scenarios. First, when the experimental structure is stable under a given MLIP but no trial structures sufficiently close to the experimental packing are generated, the failure mainly reflects a limited number of trial structures. Second, when the experimental structure is not stable under that MLIP, the failure is primarily due to deficiencies in the MLIP potential-energy surface rather than the trial-structure generation stage. It should be noted that this does not represent the full capability of our CSP strategy, as more experimental structures could be predicted with a larger number of trial structures or by using other models. For example, in our additional tests, BOQQUT01 and HAMTIZ01 could be successfully predicted using MACE–OFF–small if a larger number of trial structures were generated, while XATMIP, XATMOV, SOXLEX01, OBEQET, XATJOT, and NACJAF could be recovered when using MACE–OFF–large.
System VI represents a relatively straightforward case in which the experimental structure is correctly predicted by both MLIPs, with its energy ranking reproduced within the top three. The SOAP-space distributions generated by the two models are largely consistent, with only minor differences in the coverage of certain regions. An additional UMAP plot, color-coded by crystal density, is provided in Fig. S3. Crystal packing similarity tests for several pairs of nearby points in the projection are shown in the inset of Fig. 4, where the experimental structures predicted by both models appear close to each other. Furthermore, we evaluated how many structures among these are commonly predicted by both models. To evaluate structural overlap, we examined the 200 lowest-energy predictions from both models and found 37 shared structures, each achieving a 15/15 match (see the Methods section for the definition of a match) in the crystal packing similarity test. Other nearby pairs may also exhibit similar packing modes, and the absence of a full 15/15 match in some cases is primarily attributable to slight differences in the predicted unit cell lengths and volumes, as illustrated in the plot.
In contrast, system VII presents a more complex case. Although the experimental structure is the global minimum for both atomistic models, one predicts it ten times while the other fails entirely, despite using the same initial trial structures. Dimensionality-reduction reveals that the experimental structure and several points from SevenNet-0-D3 overlap with the MACE–OFF–small cloud, but most SevenNet-0-D3 points are clearly separated—showing that the two models explore distinct regions of structural space. In SevenNet-0-D3, a packing mode close to the experimental form was repeatedly found, but propane conformers were flipped (Fig. S4), producing a local minimum that blocked further optimization. Among the 200 lowest-energy predictions, 111 were shared by both models, reflecting the smaller packing space of small molecules. The packing similarity test and the SOAP descriptor probe different aspects—the former global packing and the latter local atomic environments—making them complementary rather than contradictory in analyzing the results.
As illustrated in the UMAP embedding of the SOAP feature space for all models (Fig. 5b), distinct patterns emerge in the low-energy regions across models. Nevertheless, all MLIPs are capable of stabilizing the experimental structure, except for the MACE–OFF–small model, which failed to predict it. This failure may be due to its comparatively less accurate energy predictions for this system, as reflected by the higher energy ranking assigned to the experimental structure. Larger models such as Eqnorm, ORB,51 and MatRIS exhibit superior performance for this small system—not only in terms of the relative lattice energy of the experimental structure but also in the frequency of successful predictions—despite requiring substantially longer computation times to complete the CSP process compared to the other two models. The energy rankings of the experimental structures obtained by Eqnorm, ORB, and MatRIS are 1, 1, and 2, respectively, indicating excellent performance in crystal structure prediction tasks. Fig. S5 shows a packing similarity analysis of structures optimized by different MLIPs, originating from seed IDs for which at least one MLIP successfully recovers the experimental structure.
System XXI represents another challenging case. It contains both 3,4,5-trihydroxybenzoic acid monohydrate and an additional water molecule, creating numerous possibilities for diverse hydrogen-bonding networks. The high flexibility of the hydroxyl groups further increases the complexity of the intermolecular packing landscape. As a result, this system exhibits three known polymorphs with the same asymmetric unit composition (XXI-1, XXI-2, and XXI-3; CCDC refcodes: KONTIQ, KONTIQ01, KONTIQ03).
The dimensionality-reduction plots for the five foundation atomistic models also reveal distinct patterns. For MACE–OFF–small and MatRIS, more points are clustered around the experimentally observed structure (denoted by the star symbol) compared with the other models, and both exhibit superior performance in locating experimental structures, as shown in Table S3. The relative lattice energies and rankings of these three known polymorphs were evaluated using DFT as the reference in order to test whether the MLIPs can reproduce the DFT polymorph energy ordering. This level of theory provides a relatively reliable preliminary standard for energy ranking, although more accurate levels of electronic-structure calculation may lead to slight changes in the ordering.15 When we compared the MLIP results with the DFT rankings, we found that MACE–OFF–small—the only foundation atomistic model having predicted all experimental structures within 5000 trial structures—incorrectly identified XXI-3 as the most stable form, in complete contradiction to the DFT reference results. By contrast, the MatRIS model successfully predicted two of the experimental structures, and its relative energy predictions for all polymorphs are in good agreement with DFT. Surprisingly, some MLIPs larger than MACE–OFF–small failed to predict any, or only one or two, of the experimental structures. This may be because models with a larger number of parameters tend to produce a more rugged crystal lattice energy landscape, requiring a greater number of trial structures to adequately converge and sample the broader packing space.
Using the energy difference (ΔE) between two MLIPs as a proxy for energy accuracy (the closer the experimental structure's relative energy is to zero, the better), we conducted a sign-consistency test between energy advantage and hit advantage. Among systems allowing a fair comparison, 12 cases favored the lower-energy model with more hits, whereas 8 favored the higher-energy model—indicating only a weak association between energy accuracy and hit count. Surprisingly, some experimental structures were successfully predicted by one MLIP but completely missed by another (see Fig. S5), despite both being applied to the same initial trial structures and both being capable of stabilizing the experimental geometries. Moreover, the energy prediction accuracy of an MLIP does not necessarily translate into greater advantages in CSP, especially in the crystal-structure search stage. Considering that larger models can be an order of magnitude more time-consuming than smaller ones, model choice remains an important practical consideration, although developing a robust prospective selection criterion lies beyond the scope of the present work. This discrepancy may arise because, beyond accuracy near local minima, differences in the MLIP-predicted lattice energy surface for a given system can also play a complex role in determining the final CSP outcome. Although some foundation models are incompatible with certain elements, the three steps of BOMLIP-CSP—conformer search, trial structure generation, and optimization—are independent. This modular design allows the workflow to be adapted by replacing the foundation model when necessary, for example with a model that supports the relevant elements. Training a model for target systems of particular interest is also a feasible strategy, similar to the customized force fields used in traditional approaches, although the associated time cost needs to be carefully considered. In addition, proton-transfer events may occur when using MLIPs. However, this does not always act in a purely detrimental way: for some systems, the hydrogen positions in the crystal may indeed involve genuine proton transfer. A detailed investigation of such effects lies beyond the scope of the present work, and here we simply note their possible presence as an additional source of complexity. In addition, evaluating a larger number of systems would further strengthen this assessment, and this will be pursued in future work.
The results demonstrate that, when the correct space group and Z′ are included in the search, the framework can readily recover over 50% of the experimental crystal structures during structure generation. When MLIPs are chosen appropriately, the success rate exceeds 70% under the present benchmark conditions, with most of these experimental structures also appearing among the lower-energy candidates. This number could be even higher with more extensive sampling. Importantly, all these accuracies are obtained using relatively small MLIPs and they have been substantially accelerated.
The benchmark also shows that even with the same trial database and MLIPs stabilizing experimental structures, different MLIPs can yield very different CSP outcomes, underscoring that not only equilibrium accuracy but also the shape of the crystal lattice energy landscape and connectivity of energy basin are critical.
Looking ahead, our framework offers substantial flexibility for integrating MLIPs into CSP workflows—for example, enabling structure generation using different MLIPs while applying a unified energy ranking method. Such capabilities will accelerate the application of AI-driven approaches in chemical materials discovery, allowing computational chemistry to play a more autonomous role in solid materials, and addressing key challenges in the field of molecular materials science.
000 trial structures per target system. All structures were then subjected to relaxation using a batched structural optimization protocol consisting of two rounds: the first round was performed under an applied pressure of 0.1 GPa, followed by a second round of optimization with the pressure released. In our previous tests, this two-step strategy produced improved results without artificially eliminating porous structures due to the initial pressure. After optimization, a very small number of structures in certain systems exhibited clear molecular collapse and unphysically low energies (around −1 × 1015 to −1 × 1025 kJ mol−1), likely due to instabilities in the MLIP. Therefore, all structures with absolute energies greater than 1 × 1015 kJ mol−1 were removed. For the low-energy structures, a duplicate removal process was performed using the CCDC API. Starting from the structure with the lowest energy, each structure was sequentially examined and added to the unique group only if it was determined to be dissimilar to all existing members in unique group. To assess structural matching and identify duplicates, a cluster of 15 molecules was extracted from each structure, and two structures were considered duplicates if their clusters exhibited a 15/15 match in the CCDC packing similarity test with a distance tolerance of 0.2 and an angle tolerance of 20°, excluding hydrogen atoms from the comparison. The same CCDC packing-similarity criterion was used for matching predicted structures to the experimental reference and for duplicate identification.
A kinetic-energy cut-off of 600 eV was used to define the plane-wave basis set. Electronic Brillouin zone was integrated with the smallest allowed spacing between k-points (KSPACING) being 0.4 Å−1 and the generated grid is centred at the Γ-point. Convergence threshold of self-consistency was set to 10−6 eV during total energy and force calculation and the Hellmann–Feynman force convergence criterion on each atom was set to smaller than 0.01 eV Å−1 when geometry optimizations are needed.
Footnote |
| † These authors contributed equally: C. Z., Z. M., & D. F. |
| This journal is © The Royal Society of Chemistry 2026 |