Open Access Article
Inhyo Leea,
Hyeokjae Chaea,
Jongwon Parka,
Jihye Shina,
Hugon Lee*a and
Seunghwa Ryu
*ab
aDepartment of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea. E-mail: hoogon99@kaist.ac.kr; ryush@kaist.ac.kr
bKAIST InnoCORE PRISM-AI Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea
First published on 20th March 2026
Machine learning (ML) surrogate models are increasingly employed to accelerate materials discovery, yet their transferability across heterogeneous databases remains unclear. In this study, we benchmark the accuracy and transferability of composition-based and structure-based surrogate models using three widely adopted two-dimensional (2D) materials databases: Computational 2D Materials Database (C2DB), 2D Materials Encyclopedia (2DMatPedia), and Joint Automated Repository for Various Integrated Simulations (JARVIS-2D). We evaluate predictive performance for energy per atom and bandgap under database-to-database transfer, probe the effects of dataset size and coverage through down-sampling, and analyze error correlations across surrogate models used in this study. Energy-related properties were predicted robustly, whereas bandgap prediction proved to be substantially more difficult due to data imbalance and inconsistencies in DFT parameters across databases. Composition-based models generally exhibited more stable cross-database performance than structure-based models, underscoring that incorporating structural features does not necessarily lead to better generalization. Through down-sampling and error correlation analyses, we demonstrate that cross-database performance is primarily determined by the coverage of the training dataset and that distinct error patterns emerge from differences in the models’ feature representation and architecture. Together, this study provides a systematic characterization of surrogate model robustness across 2D materials databases, offering insights into the factors that determine their transferability in materials discovery.
Despite these advances, a central challenge is generalization under distribution shifts. Surrogate models often achieve high accuracy within the training distribution (i.e., in-distribution (ID)) but deteriorate sharply when applied to out-of-distribution (OOD) data.15 This limitation is critical for materials discovery, as new materials of interest are typically within the OOD domain. Miscalibrated OOD predictions can both waste experimental resources through false positives and obscure important candidates through false negatives.
Accordingly, evaluating surrogate-model performance in genuine OOD settings has become an important topic. Community benchmarks—Matbench,16 Open Catalyst Project,17 MatSciML,18 and JARVIS-Leaderboard19—provide valuable testbeds for assessing model accuracy across diverse material properties. However, these benchmarks generally rely on random train/test splits within the same database, which tend to overestimate OOD performance. Structural and chemical redundancy in materials databases often leads to substantial overlap between the training and test sets, making the resulting evaluation insufficiently sensitive to true distribution shifts.20,21
To counter this issue, many studies introduce artificial OOD splits. Common strategies include element-exclusion schemes (leave-X-out), sparsity-based splits that hold out low-density regions of the data space (sparse-X or Y-single), and property-range-based splits.22–25 While these procedures provide controlled stress tests, they still do not fully capture realistic deployment scenarios.26–28
Building on these observations, the limitations of existing OOD evaluation approaches can be addressed in two complementary directions. The first is to establish more realistic OOD scenarios. In this regard, evaluating surrogate models across heterogeneous materials databases is practically meaningful for two reasons. First, such cross-database settings directly mirror situations encountered in practical materials discovery, where a surrogate model trained on one database may need to be applied to another for the same property, enabling rapid screening and ranking with only minimal calibration.29–31 Second, given their inherent differences in compositional distributions and methodological protocols, these databases provide a realistic setting for evaluating the out-of-distribution performance of surrogate models through cross-database testing.28
The second direction is to expand OOD evaluation to emerging material families. Most existing benchmarks primarily focus on bulk materials, which limits their relevance for emerging material families. A representative example of such an emerging class is 2D materials, which possess atomic-scale thickness and highly tunable properties and have rapidly become a particularly important focus in materials research.32–34 Their reduced dimensionality gives rise to unique electronic, optical, and mechanical behaviors—such as high electrical conductivity in 2D–Cu2Si35 and exceptional Li storage capacity in CrB4 and MoB4 monolayers36—making them attractive for applications in energy storage, sensing, and semiconductors.37–39 The rapid growth of 2D materials research has been accompanied by the increasing use of ML methods for stability, bandgap, and property prediction.40–42 Consequently, a growing number of 2D materials databases have been compiled to train these ML models. However, the heterogeneity of the existing 2D materials databases raises fundamental questions about how reliably surrogate models transfer across them.
For these reasons, we systematically evaluate surrogate models in a database-to-database transfer setting, where models trained on one database are directly applied to another without fine-tuning. To ensure a meaningful assessment of OOD generalization, we select three 2D materials databases with distinct construction protocols—including their DFT settings, material-generation methods, and compositional coverage (see details in the Methods section). Although many other 2D materials databases exist, such as Alexandria,43 MC2D44 and others,45,46 selected databases offer heterogeneous workflows and coverage to serve as a practical testbed for cross-database transferability. Specifically, we consider the Computational 2D Materials Database (C2DB),47 2D Materials Encyclopedia (2DMatPedia),48 and Joint Automated Repository for Various Integrated Simulations (JARVIS-2D).49 By comparing representative composition-based and structure-based surrogate models in OOD settings, we address the following questions:
1. How do the dataset size and coverage influence the extrapolation performance of a surrogate model?
2. How do composition- and structure-based approaches differ in their robustness and generalization?
To this end, we evaluate surrogate models on energy per atom and bandgap. Beyond reporting accuracy and transferability, down-sampling and error correlation analyses were employed to examine how training the data size, distributional coverage, and model architecture influence extrapolation performance.
Predictive performance was evaluated using two complementary metrics: mean absolute error (MAE) and Spearman correlation. The MAE quantifies absolute predictive accuracy, while the Spearman correlation coefficient measures the degree to which predicted and true values preserve a monotonic ranking. In addition, we performed down-sampling experiments to probe dataset size and coverage effects and error correlation analyses to examine whether different surrogate classes capture complementary aspects of the data. Together, these analyses provide a systematic framework for assessing model robustness under cross-database transfer.
000 materials and has expanded through multiple releases (C2DB-2018,50 C2DB-2021,51 and C2DB-202247). Earlier versions combined experimentally synthesized materials with hypothetical structures generated via lattice decoration, while the most recent release further incorporates crystal diffusion variational autoencoder (CDVAE)-generated materials.52 Each entry includes a wide set of properties, from stability-related properties and elasticity to magnetism and band structure.
2DMatPedia comprises more than 6000 materials derived from both exfoliation of layered bulk materials (top-down) and elemental substitution (bottom-up).48 Reported properties include decomposition energy, exfoliation energy and bandgap, all calculated using DFT.
JARVIS-2D contains about 1100 monolayers generated from a bulk compound using standardized DFT workflows.49 Alongside formation and exfoliation energies, it reports electronic descriptors such as bandgap, work function, and electron affinity, as well as application-oriented properties such as dielectric properties, Seebeck coefficient, and theoretical solar cell efficiency.
In this study, we consider two targets: energy per atom and bandgap. Energy per atom refers to the ground-state total energy normalized by the number of atoms, serving as the base quantity for deriving thermodynamic stability metrics such as formation energy and energy above the convex hull. The bandgap, defined as the energy difference between the valence and conduction bands, governs a material's conductivity and is therefore a key descriptor for applications in electronics, optoelectronics, and energy devices.
The three databases differ in their exchange-correlation functionals, plane-wave cutoffs, k-point sampling densities and others (see details in Section S1 and Table S1, SI), which are known to introduce systematic deviations even for identical materials.53,54 Such deviations may confound the interpretation of surrogate-model performance under cross-database transfer. To clarify this effect, we compared the target properties for identical materials that appear in multiple databases. While discrepancies arising from methodological differences do shift absolute values, the deviations are generally small, and the overall trends remain consistent (Fig. S1, Tables S2 and S3, SI). For this reason, both MAE and Spearman correlation were employed to evaluate predictive performance.
1. Entries corresponding to elemental compounds were excluded.
2. For compositions with multiple polymorphs, only the lowest-energy structure was retained.
3. If the same composition appeared in both the training and test sets, the duplicates were removed from the test set.
Although structure-based surrogate models (which will be detailed in Section 2.4) are capable of distinguishing polymorphs with identical compositions, the same preprocessing rules were applied to both composition- and structure-based models to ensure fair comparison. After preprocessing, the dataset sizes were 14
131 (C2DB), 5328 (2DMatPedia), and 893 (JARVIS-2D) for energy per atom, and 6579 (C2DB), 5328 (2DMatPedia), and 893 (JARVIS-2D) for bandgap (Table S4, SI). The distribution of the target properties in the preprocessed datasets is shown in Fig. S2, SI.
Structure-based models incorporate structural information such as atomic connectivity and bonding environments,62–64 typically via graph neural networks (GNNs) as the standard approach.65,66 While often leading to better accuracy, these models typically demand larger training datasets. We employed five widely used architectures: Graph Convolutional Network (GCN),67 Materials Graph Network (MEGNet),68 SchNet,69 Crystal Graph Convolutional Neural Network (CGCNN),70 and DeeperGATGNN.71 A brief explanation of each model is provided in Section S3, SI, and hyperparameters are listed in Table S5, SI.
Within-database experiments (training and testing on the same source) served as in-distribution (ID) references. ID evaluation of representative models from each class—feature-based models, composition-based neural networks, and structure-based neural networks—was benchmarked, as summarized in Table 1 for Spearman correlation, showing generally similar performance across the models. Corresponding MAE results are provided in Table S6, SI.
| Database | Property | Spearman correlation | ||
|---|---|---|---|---|
| RF | CrabNet | DeeperGATGNN | ||
| C2DB | Energy per atom (eV atom−1) | 0.99 | 1.00 | 1.00 |
| Bandgap (eV) | 0.81 | 0.84 | 0.82 | |
| 2DMatPedia | Energy per atom (eV atom−1) | 0.97 | 0.98 | 0.99 |
| Bandgap (eV) | 0.76 | 0.75 | 0.73 | |
| JARVIS-2D | Energy per atom (eV atom−1) | 0.95 | 0.97 | 0.96 |
| Bandgap (eV) | 0.77 | 0.74 | 0.74 | |
In addition to the database-to-database transfer experiments, we conducted two complementary analyses. First, down-sampling experiments reduced large training sets to the size of smaller databases, enabling the separate assessment of dataset size and distributional coverage. Second, error-correlation analyses quantify similarities and differences in prediction-error patterns across model classes, providing insight into whether different surrogate models capture the complementary aspects of the data. Together, these analyses extend beyond conventional benchmarking and provide deeper understanding of the factors governing model generalization across heterogeneous databases.
For energy per atom, C2DB covers the broadest compositional space (Fig. 2(a)), consistent with its larger dataset size (14
131 materials) and inclusion of multi-component compounds generated by the CDVAE model.52 Many of these compositions occupy regions not represented in 2DMatPedia (5328 materials; Fig. 2(b)) or JARVIS-2D (893 materials; Fig. 2(c)), both of which are dominated by materials obtained through bulk-derived or bottom-up approaches. This compositional diversity was quantitatively confirmed by calculating the Jensen–Shannon (JS) divergence values on the UMAP projected space. The JS divergence quantifies the similarity between two distributions, with values toward zero indicating greater similarity. Consistent with visual observations, the calculated JS divergences show that 2DMatPedia and JARVIS-2D are most similar (0.05), while C2DB shows a slightly higher divergence from 2DMatPedia (0.11) and JARVIS-2D (0.09).
For bandgap, the coverage of C2DB narrows because bandgaps were calculated only for thermodynamically and dynamically stable materials in the database construction workflow. Nevertheless, C2DB maintains the widest range (Fig. 2(d)), whereas JARVIS-2D exhibits the sparsest (Fig. 2(f)). The corresponding JS divergence values are slightly smaller than those for energy per atom, indicating better similarity across databases: 0.10 for C2DB–2DMatPedia, 0.06 for C2DB–JARVIS-2D, and 0.05 for 2DMatPedia–JARVIS-2D.
Distribution analyses based on UMAP-projected Magpie structural features showed similar trends, with C2DB exhibiting greater structural diversity and JARVIS-2D exhibiting the most restricted coverage (Fig. S3, SI). These differences provide the basis for evaluating cross-database transfer in the following sections.
![]() | ||
| Fig. 3 Database-to-database performance of surrogate models for (a)–(c) energy per atom and (d)–(f) bandgap. | ||
For energy per atom, Spearman correlations typically range from 0.84 to 0.98, indicating generally consistent results among models and high transferability. The best performance was observed for 2DMatPedia–JARVIS-2D (with averages of 0.95 across composition-based models and 0.96 across structure-based models, Fig. 3(c)). By contrast, models trained on C2DB achieved lower transferability (≈0.85–0.87, Fig. 3(b and c)), despite the larger dataset size. This outcome reflects a distributional mismatch: C2DB contains numerous CDVAE-generated multi-component compositions, which expand the chemical space but limit overlap with 2DMatPedia and JARVIS-2D.
For bandgap prediction, average Spearman correlations ranged from 0.47 to 0.66, indicating substantially lower transferability than for energy per atom. This reduction stems from the well-known, material-dependent errors in DFT bandgap calculations. Self-interaction and the incomplete treatment of electron correlation lead to systematic underestimation, and the magnitude of this error varies across materials due to differences in orbital localization and local bonding environments.73,74 Additional inconsistencies arise from the empirical choices of Hubbard U parameters used in different databases, further introducing non-uniform biases that surrogate models cannot easily learn.75 Moreover, the bandgap distribution is highly imbalanced, with many materials clustered near 0 eV and relatively few wide-gap cases (Fig. S2, SI), further complicating training.
Nevertheless, bandgaps from lower-cost DFT approximations (e.g., GGA-PBE76) often show strong linear correlations with values obtained from higher-level methods such as HSE0677 or GW,78 particularly within specific chemical families. Thus, predicting bandgaps from lower-cost DFT approximations remains valuable, offering an efficient pathway toward more accurate bandgap estimates.79,80
Within this overall trend, the highest transferability was observed for 2DMatPedia–C2DB (0.63 for composition-based and 0.60 for structure-based models, Fig. 3(d)), while the lowest was found for JARVIS-2D–2DMatPedia (0.55 and 0.48, respectively, Fig. 3(e)). Unlike energy per atom, C2DB-trained models showed relatively better transferability for bandgap, which can be explained by greater local similarity between C2DB training samples and test samples from other databases (Table S7, SI). The roles of global and local similarity in shaping predictive performance are further discussed in Section 3.3.
At the model level, feature-based models served as stable baselines, with GBR achieving correlations up to 0.95 (Fig. 3(c)). Among neural composition-based models, CrabNet consistently achieved the highest performance, while ElemNet showed the weakest results (e.g., 0.77 for C2DB–JARVIS-2D, Fig. 3(c)). For energy per atom, structure-based models showed broadly similar performance, with DeeperGATGNN and CGCNN typically achieving the highest correlations and GCN the lowest. Nevertheless, composition-based models outperformed structure-based models in terms of overall predictive performance. Higher Spearman correlations were generally associated with lower MAE values. Full MAE and Spearman correlation results are presented in Tables S8 and S9, SI, respectively, for composition-based models, with a visual summary of MAE in Fig. S4, SI. The corresponding results for structure-based models are provided in Tables S10 and S11 and Fig. S5, SI, respectively.
These results demonstrate that (i) predictive performance in database-to-database transfer depends not only on the size of the training data but also on their coverage, and (ii) unlike in ID evaluations, cross-database transfer revealed substantial differences in predictive performance among surrogate models trained on the same dataset.
For energy per atom, sensitivity to data size depended on the train–test pairing. For CrabNet (Fig. 4(a)), C2DB–2DMatPedia dropped from 0.88 (full) to 0.72 ± 0.03 when down-sampled to the JARVIS-2D size. By contrast, in C2DB–JARVIS-2D, transferability remained nearly unchanged (0.92 full vs. 0.90 ± 0.01 down-sampled). When trained on 2DMatPedia, both models transferred relatively well to the other databases (Fig. 4(b)). Similar tendencies were also observed for DeeperGATGNN. For bandgap, down-sampling effects were more pronounced. In C2DB–JARVIS-2D, Spearman correlation decreased from 0.66 to 0.50 ± 0.04 for CrabNet and from 0.60 to 0.50 ± 0.03 for DeeperGATGNN (Fig. 4(c)). In 2DMatPedia–C2DB, performance dropped by approximately 0.08–0.11 for both CrabNet and DeeperGATGNN (Fig. 4(d)). Variance across the ten down-sampled subsets increased as the training size decreased.
To isolate the role of coverage independent of dataset size, we compared equal-sized down-sampled subsets. Fig. 5 illustrates this comparison by showing the compositional feature distributions in the UMAP space, in the case of 2DMatPedia–JARVIS-2D bandgap prediction using CrabNet. It contrasts high- and low-performance training subsets, each down-sampled to match the size of JARVIS-2D, with the corresponding test data. Nearest-neighbor distances measure the closeness of each test sample to the training data in the UMAP space.
The global distributional similarity in the UMAP space was nearly identical between the two subsets. Both showed comparable JS divergence (0.09 vs. 0.10) and average nearest-neighbor distances (0.12 vs. 0.13) (Fig. 5(a and b)). In contrast, the local similarity revealed clear differences. Specifically, when examining the ten test samples with the largest nearest-neighbor distances to the training data, the average distance was 0.31 for the high-performance set and 0.38 for the low-performance set (Fig. 5(c)).
These results demonstrate that predictive reliability depends not only on global distributional similarity but also on how well local regions of the test space are covered by the training data. Consequently, reducing the size of the training set diminishes its coverage and leads to degraded transfer performance. Consistently, larger nearest-neighbor distances in the latent space are associated with lower predictive accuracy.81
Therefore, variations in coverage across databases naturally give rise to differences in predictive performance. A larger training dataset does not necessarily translate into stronger transferability, since size alone does not ensure distributional similarity with the target data. For instance, although C2DB contains far more entries for energy per atom, its transfer performance to JARVIS-2D or 2DMatPedia is not consistently superior to that of models trained on the smaller databases.
For energy per atom, both models achieved high Spearman correlation (0.84–0.98, Fig. 6(a)). CrabNet slightly outperformed DeeperGATGNN when trained on C2DB (0.92 vs. 0.85 for C2DB–JARVIS-2D), whereas the performance was comparable when trained on 2DMatPedia or JARVIS-2D.
For bandgap, the differences were more pronounced (Fig. 6(b)). CrabNet consistently exhibited stronger transferability, outperforming DeeperGATGNN by 0.06–0.10 in C2DB–2DMatPedia and C2DB–JARVIS-2D. The only case favoring DeeperGATGNN was when trained on JARVIS-2D, where it reached 0.61 compared to CrabNet's 0.55.
Overall, these results suggest that simpler composition-based models can be more robust in OOD scenarios, whereas both model classes show comparable predictive performance in ID settings (Table 1), with structure-based models sometimes showing an advantage for certain properties. This trend arises because structure-based models, while benefiting from richer structural information and higher expressiveness in ID tasks, also introduce higher feature dimensionality and greater sensitivity to representation bias, which can exacerbate overfitting under distributional shift. This reflects a broader trade-off, where improving model expressiveness typically enhances ID accuracy but may degrade OOD performance.22,82 Consequently, in scenarios with distributional shift or limited access to structural information, composition-based models represent a more robust choice.
Error correlation analysis using Pearson correlation matrices (Fig. 6(c and d)) further clarified these systematic differences across model classes. A higher error correlation indicates that two models tend to make similar predictions for the same materials. For energy per atom, feature-based models clustered together with correlations of 0.79–0.85. Composition-based neural networks formed a separate cluster (0.70–0.76), with structure-based models ranging from 0.70 to 0.86. By contrast, cross-family correlations were substantially lower (e.g., RF–SchNet: 0.43), highlighting that different model classes capture complementary aspects of the data.
For bandgap, error correlations were more dispersed. Feature-based models still showed relatively high correlation (0.84–0.93), composition-based neural networks ranged from 0.76 to 0.79, and structure-based neural networks from 0.72 to 0.80, except for MEGNet. Correlations between composition- and structure-based models typically fell in the range of 0.60 to 0.70. The tendency of models within the same class to form distinct error-correlation clusters was even more pronounced in other database transfer scenarios (Fig. S6 and S7, SI).
Feature-based models, which rely on a fixed set of predefined features, naturally exhibited high error correlations. By contrast, neural network models learn internal feature representations during training, leading to lower correlations. In addition, error correlations tended to be stronger within composition- or structure-based classes than across them, reflecting the distinct input information used by each class of models.
Taken together, our analysis shows that predictive performance of surrogate models in cross-database transfer is shaped by two main factors. First, the overall level of predictive performance is largely determined by the distributional similarity between the training and test datasets. Second, even with the same training data, the pattern of predictive error varies with the input representation and architecture of the models. This explains why models that perform similarly in ID evaluations can diverge substantially in OOD settings: each model class relies on different input information and processing strategies—using fixed descriptors, stoichiometric embeddings, or graph-based structural representations—resulting in distinct strengths and limitations in OOD transfer.
Through down-sampling experiments, we confirmed that predictive performance depends not only on dataset size but also on its coverage, including both global and local distributional similarity between training and test sets. Error correlation analyses revealed that models within the same class shared similar error patterns. Even for the same material, differences in how each model family represents the input information led to different error behaviors. Additionally, composition-based models often show more stable performance in out-of-distribution settings than structure-based models.
Overall, this study provides a systematic benchmark for surrogate models by introducing database-to-database transfer as a practical OOD evaluation setting in 2D materials discovery. The findings highlight the importance of dataset coverage, property-specific challenges, and model diversity, offering guidance for building more reliable and transferable ML frameworks in materials science. Future work could extend this framework by incorporating additional databases and target properties, as well as by evaluating a broader class of surrogate models, potentially with tailored architectures.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5cp04814a.
| This journal is © the Owner Societies 2026 |