Open Access Article
Panagiotis
Krokidas
*a,
Vassilis
Gkatsis†
ab,
John
Theocharis†
c and
George
Giannakopoulos
ad
aInstitute of Informatics and Telecommunications, National Centre for Scientific Research “Demokritos”, Agia Paraskevi, 15310, Greece. E-mail: p.krokidas@iit.demokritos.gr
bDepartment of Informatics and Telecommunications, National and Kapodistrian University, Athens, Greece
cPhysics Department, National and Kapodistrian University, Athens, Greece
dSciFY PNPC, Athens, Greece
First published on 3rd November 2025
Machine learning (ML) has the potential to accelerate the discovery of high-performance materials by learning complex structure–property relationships and prioritizing candidates for costly experiments or simulations. However, ML efficiency is often offset by the need for large, high-quality training datasets, motivating strategies that intelligently select the most informative samples. Here, we formulate the search for top-performing functionalized nanoporous materials (metal–organic and covalent–organic frameworks) as a global optimization problem and apply Bayesian Optimization (BO) to identify regions of interest and rank candidates with minimal evaluations. We highlight the importance of a proper and efficient initialization scheme of the BO process, and we demonstrate how BO-acquired samples can also be used to train an XGBoost regression predictive model that can further enrich the efficient mapping of the region of high performing instances of the design space. Across multiple literature-derived adsorption and diffusion datasets containing thousands of structures, our BO framework identifies 2×- to 3×-more materials within a top-100 or top-10 ranking list, than random-sampling-based ML pipelines, and it achieves significantly higher ranking quality. Moreover, the surrogate enrichment strategy further boosts top-N recovery while maintaining high ranking fidelity. By shifting the evaluation focus from average predictive metrics (e.g., R2, MSE) to task-specific criteria (e.g., recall@N and nDCG), our approach offers a practical, data-efficient, and computationally accessible route to guide experimental and computational campaigns toward the most promising materials.
Despite these advances, ML-driven discovery remains constrained by several challenges. Chief among them is the paradox between the promise of ML to reduce experimental costs and the substantial data requirements it imposes. Generating sufficiently large, high-quality datasets—whether through experiments or simulations—can be prohibitively expensive, undermining the very efficiency ML aims to deliver.7–9 To address this, significant effort has been invested in smart sampling strategies, broadly referred to as Active Learning (AL),10,11 which aim to minimize the number of required samples while maximizing predictive accuracy.
However, despite their conceptual appeal, many AL strategies often struggle to consistently outperform passive learning approaches,12 in which ML models are trained on randomly selected samples. In fact, random sampling remains a surprisingly strong benchmark.11 Moreover, maximizing predictive performance (e.g., via R2, mean squared error) may not always align with the practical goals of materials scientists. In many cases, the primary objective is not to model the entire design space, but rather to identify regions containing top-performing materials. As we demonstrate in this work, standard ML metrics often fail to reflect performance in this specific task.
Identifying high-performing sub-regions can be formulated as an optimization problem in which a sampling algorithm iteratively selects new points, not to reduce uncertainty as in active learning, but to maximize an acquisition function. An acquisition function is a heuristic that quantifies the utility of evaluating a candidate point, balancing exploration of uncertain regions with exploitation of high-predicted-value regions. Bayesian Optimization (BO) provides a principled solution in this context, serving as a global optimizer over complex design spaces.13–16 In this work, we adapt BO not only to identify the single best-performing instance, but also to recover an ensemble of the top-N performers (e.g., top-10 or top-100), reflecting the practical needs of materials scientists who often require multiple candidates rather than a solitary optimum. We address the following core research questions:
(1) How many samples are needed to identify regions within large design spaces (containing thousands to hundreds of thousands of materials) that contain top-performing candidates? We note that this number depends strongly on the task at hand, the complexity of the underlying structure–property relationships, and the choice of feature representation.
(2) How many samples are required to identify the single best-performing material in such spaces?
(3) How does our approach compare to an ML model trained on an equal number of randomly selected samples, particularly in terms of ranking the top-performing materials and identifying the global optimum?
While BO is a powerful framework, it can incur substantial computational expense.17 To mitigate this, we introduce frugality-oriented elements. Here, frugality refers primarily to minimizing the number of costly experimental or simulation evaluations required to identify high-performing materials, which is the main bottleneck, but we also consider simple strategies to reduce computational overhead. First, we quantify how the choice of initial samples influences BO's convergence and overall performance. Unlike our baseline method (Random Sampling ML), BO is always initialized with a simple yet effective, informed strategy that combines one central point and two diverse points, ensuring both representativeness and diversity in the initial sampling. This primarily supports experimental efficiency by ensuring informative early evaluations. Next, we evaluate batch sampling strategies—selecting multiple candidates per iteration—to strike an optimal balance between predictive accuracy and runtime efficiency. Batching reduces computational cost by limiting the number of surrogate retrainings, while also enabling parallel experiments in principle. Finally, we show that, by training a machine-learning surrogate (e.g., XGBoost) on the BO-acquired samples after the campaign, we can predict and rank the remainder of the design space. This enrichment step mainly reduces experimental effort by identifying additional top-N candidates without further evaluations, while also providing a lightweight ranking at low computational cost. Fig. 1 summarizes our approach.
We evaluate our method across a diverse collection of literature-based datasets involving gas adsorption and diffusion in functionalized nanoporous materials, including metal–organic frameworks (MOFs) and covalent–organic frameworks (COFs). In all cases, our BO framework outperforms random-sampling-based ML pipelines in both identifying and ranking top-performing candidates. Notably, we not only measure success in terms of top-performer recovery but also assess the quality of the ranking (nDCG; see Section 2.4).
Inspired by recent advancements in Bayesian Optimization (BO) applied to MOFs and related materials, we investigate the potential of BO to guide researchers in allocating their N experiments. The goal is to iteratively direct experiments toward regions of the design space with higher performance, progressively converging on areas of interest and improving efficiency in identifying exceptional materials.
space corresponds to the design of one material and each material may differ slightly or greatly from the others in terms of a target property value. So:![]() | (1) |
We define a machine learning model as the process p that learns the mapping from the general material space
to this target property y, so:
![]() | (2) |
Let the dataset used to train this machine learning model be denoted as Dtrain which consists of N data points, same as the budget for experiments. This can be defined as:
![]() | (3) |
Similarly we can define the test dataset Dtest and the evaluation dataset Deval. Now lets consider a method for evaluating this model. Let us denote this method as
which takes the ML model p and evaluation dataset and provides a measure of the performance of the model. The logic and metrics used for the evaluation are described in detail in the Evaluation metrics section.
Our goal is to use the least amount of data, less than or equal to the available budget, in order to achieve the best performance score. In practical terms, this problem is inherently multi-objective: we aim to minimize the number of samples required for training while simultaneously maximizing the predictive performance of the model. The trade-off between these two goals is the central question addressed in this work.
BO is an iterative process where at each iteration we train an ML model (called a surrogate model) with the currently acquired data, we use this model to make predictions about a specific data property on the whole dataset. Finally an acquisition function is being utilized to select the most informative data point and add it to the dataset. The surrogate's uncertainty quantification is central to this process, since the acquisition function balances exploration (sampling uncertain regions) with exploitation (sampling high-predicted-value regions). Practically, the surrogate model represents our current beliefs about the target property that we are trying to maximize (or minimize) and the acquisition function seeks to select data points from areas of the data space that we lack knowledge of. By transferring this scheme to our problem, we state that BO determines which experiments should be performed by designating the most promising candidate materials in terms of target property value maximization.
We adopt the open-source implementation of Gantzler et al.,19 which is built on the BoTorch library20 for Gaussian process–based Bayesian optimization, as the foundation for our framework; in Section 2.5, we detail the extensions we introduce on top of this BO implementation.
In the following paragraph we make clarifications concerning the details of our method.
As our surrogate model we have selected a Gaussian Process (GP) model due to its efficiency in representing the uncertainty of knowledge. The model consists of two parts, a mean function and a kernel (covariance function)
| Y(x) ∼ GP(μ(x), K(x, x′)) | (4) |
| μ(x) = C | (5) |
![]() | (6) |
The acquisition function that we have selected is Expected Improvement (EI):
![]() | (7) |
At the conclusion of the BO process, users gain access to a curated set of high-performing materials from the design space. As we will demonstrate later, these selected points form an information-rich dataset containing instances of optimal performance. This dataset can then be used to train the same predictive model employed in the random sampling approach (XGBoost), enabling it to make predictions across the entire design space and further expand the list of high-performing materials with additional suggested candidates. Consequently, the final selection of top-performing materials is derived from a combined dataset consisting of BO-acquired samples and XGBoost predictions trained exclusively on these samples. As we will show in later sections, this strategy proves highly effective, as the trained model excels at distinguishing and identifying high-performing instances, further enhancing the optimization process. A graphical representation of our pipeline is depicted on Fig. 1.
Recent works have applied BO directly to nanoporous materials design. Deshwal et al.22 demonstrated that BO can efficiently navigate a database of 70
000 COFs to identify those with highest methane deliverable capacity, outperforming random search, evolutionary algorithms, and one-shot ML baselines, while also acquiring a significant fraction of the top-performing structures after relatively few evaluations. Gantzler et al.19 extended this idea by employing multi-fidelity BO for COFs in Xe/Kr separations, showing that combining low-cost approximate evaluations with high-fidelity simulations accelerates the search. Together, these studies established BO as a powerful framework for adsorption and diffusion problems in porous materials. In this work, we demonstrate how three complementary elements—diversity-preserving initialization, batch-mode acquisitions, and surrogate enrichment with XGBoost—can be combined into a coherent framework, whose integration provides a practical and effective workflow for materials discovery.
In our experiments, where the design space is finite and the target property values for all candidate materials are known, we can easily rank the materials in descending order and extract the top-N (where N is either 100 or 10, in this work). Ideally, our model's predictions should rank the same materials within the top-N while closely approximating their actual target property values. To evaluate our model's performance in these tasks, we employed the following metrics.
![]() | (8) |
![]() | (9) |
![]() | (10) |
![]() | (11) |
![]() | (12) |
However, purely random initialization can introduce statistical variability, potentially leading to inconsistencies in performance when applying BO to real-case scenarios. To mitigate this, in our work we employ an informed initialization strategy rather than random selection, following the approach of Gantzler et al.19 Specifically, we first determine a central sample by computing the mean of all feature values and selecting the candidate whose features are closest to this mean, which serves as a representative point of the design space. Next, to ensure diversity in the initial training set, we apply a diverse-set selection procedure that, starting with the central sample, iteratively identifies additional samples that maximize the minimum Euclidean distance from the already selected points.
This procedure guarantees that the initial three samples are simultaneously representative and diverse, providing the Gaussian process surrogate model with a robust starting dataset for BO. In our case, three such samples were selected. As shown in the SI (Table S1), this approach yields performance comparable to the average of 20 BO runs with different random initializations (three points each, 100 steps), in identifying the top-100 instances for all datasets considered in this work. We emphasize that this comparison was performed deliberately to confirm that our initialization scheme does not bias performance upward relative to random initialization, but rather offers a practical and robust one-shot alternative in settings where repeated BO restarts are not feasible.
We note that the notion of ‘diversity’ depends on the chosen feature representation; different fingerprints (e.g., chemical vs. geometric) can yield different diverse sets.24 The present work adopts the feature sets provided in the literature datasets considered here, but in general, the effectiveness of a diversity-based initialization strategy depends on the availability of a feature representation that meaningfully captures structural and chemical differences.
Fig. S1 in the SI compares single-sample BO with batch sizes of 5 and 10 samples per iteration, evaluating their performance in terms of recall@100 and best-sample identification as functions of sample size and computational time. The test case involves the dataset by Mercado et al.,25 comprising 70
000 COFs evaluated for methane deliverable capacity. Based on this analysis, we adopt a batch size of 5 samples per BO iteration throughout this work, as it provides an effective balance between computational efficiency and performance. Notably, this configuration achieves the same recall@100 and identifies the best-performing COF using 700 samples at just one-tenth of the computational time compared to single sampling. We note that batch BO itself is well established in the literature, particularly through methods such as q-EI.26,27 Here, we adopt a simpler strategy: selecting the top-k EI points per iteration. This makes batching straightforward to implement in similar workflows while retaining the benefits of parallelism and reduced runtime.
Fig. 1 summarizes our approach, as was described in Section 2.5.1–2.5.3.
000 COF, where they report the uptake and deliverable capacity of CH4 in them, through Monte Carlo simulations. The same COFs database was used by Deshwal et al.,22 for methane uptake values, for the development of a BO routine that identifies the best candidate material. The same structure database was employed in the 2023 work by Aksu and Keskin2 for where they report a high-throughput and ML scheme for the identification of COFs with CH4/H2 separation performance. Here, we use CH4 uptake and deliverable capacity as target values. Orhan et al.28 reported a 5600 MOFs databases in their high throughput screening work for O2/N2 materials. The target properties we considered were the diffusivity and uptake of O2, and the diffusion selectivity of O2/N2. Another database we considered was the one developed by Majumdar et al.29 which includes more than 20
000 hypothetical MOFs, along with various gas properties, of which we kept H2 uptake capacity, CO2 uptake, N2 uptake, CO2 working capacity, and CO2/N2 selectivity. This database was employed, also, by Daoo et al.30 in their work on Active Learning methods for high-performing MOFs for the separation of C2H2/C2H4 and C3H6/C3H8. We kept as target properties the C2H2 and C2H4 uptakes. Villajos et al. in their 2023 work31 reported an extended dataset for H2 adsorption at cryogenic temperatures, where they provide 3600 MOFs with crystallographic and porous properties, along with volumetric and gravimetric capacities. In our work we consider as target property the gravimetric capacity. Aksu and Keskin2 reported a high-throughput computational screening combined with ML for the identification of high-performing COFs as adsorbents for CH4/H2 separations in pressure-swing and vacuum-swing adsorption (PSA and VSA, respectively). In our work we considered as target properties the CH4 and H2 uptakes at 1 bar pressure.
A ranking of the top-100 predicted values compared with the actual top-100 reveals that the conventional ML (Random Sampling ML) approach identifies only 2 of the true top-100 performing COFs. In contrast, the BO approach successfully selects 20 COFs within the top-100 tier. Moreover, when the XGBoost model is trained on the BO-acquired samples, it identifies an additional 20 top-performing COFs, boosting the overall count to 40.
This result highlights the value of using Bayesian Optimization (BO) to acquire high-interest samples, as it complements and enhances subsequent ML-based ranking. Although the XGBoost model trained on BO-acquired samples exhibits lower overall predictive performance—achieving an R2 of 0.70 compared to 0.85 for the model trained on randomly selected samples, along with a higher MSE (see Fig. S2 in the SI)—its focused training on a promising subregion of the design space makes it particularly effective at accurately identifying and ranking the top-performing COFs.
This becomes evident when evaluating the models specifically on the top-100 region: the XGBoost model trained on BO-selected samples achieves better R2 and lower MSE than the one trained on random samples Fig. S2, despite its lower global metrics. This illustrates that common evaluation metrics such as R2 and MSE, when applied over the entire dataset or random subsets, can be misleading in assessing a model's true utility. In scenarios where the goal is to discover rare but high-value regions in the design space, average performance across the whole dataset does not reflect the model's effectiveness in those critical areas.
Thus, our approach uses BO for targeted sample acquisition in a large design space and then employs ML to enrich the top-100 findings, through predictions (Fig. 2).
First, we evaluate the methane uptake dataset for COFs from Mercado et al.25 As shown in Fig. 3(a), although the XGBoost model trained on randomly selected samples (Random Sampling ML) gradually improves its recall@100 with increasing sample size, even 1000 samples yield only marginal gains (recall@100 ≈ 50). In stark contrast, our BO framework achieves a recall@100 of 93 from the very first iterations. Remarkably, BO pinpoints the single best-performing COF with just 50 samples, whereas the random sampling strategy fails to identify the top candidate even after 1000 evaluations.
Moreover, it is worth mentioning that the nDCG values are considerably higher for BO, highlighting the ability of our approach to not only find more of the top-100 instances, but ensure a more accurate positioning of them, closer to the actual ranks of the materials, in terms of their performance (see Fig. S3).
Fig. 4(a) demonstrates that for the more challenging methane deliverable capacity target, BO still vastly outperforms the conventional ML approach based on Random Sampling ML in terms of recall@100. The BO approach identifies up to 80 of the top-100 COFs—slightly lower than the performance for methane uptake—that creates a clear performance gap with the Random Sampling ML, which at 1500 samples identifies 19 top-100. This highlights BO's superior ability to target high-interest regions in complex design spaces.
Fig. 4(b) further emphasizes this advantage when it comes to identifying the single best-performing COF. For methane deliverable capacity, BO requires approximately 300 samples to reliably pinpoint the best COF, compared to just 50 samples for methane uptake. In contrast, the Random Sampling ML approach shows a steady but limited improvement in the best COF value as more samples are added, indicating its difficulty in effectively exploiting additional data to locate the optimum. Moreover, nDCG is consistenly higher for BO, reaching almost 1, while Random Sampling ML maxes below 0.9 (Fig. S3). These results confirm that BO is a highly effective sampling strategy, particularly in challenging scenarios where the design space is vast and the optimal regions are hard to exploit using conventional methods.
Fig. 5–7 summarize the results for the remaining datasets considered in this work. The first two columns of Fig. 5 illustrate the number of samples required by both Bayesian Optimization (BO) and Random Sampling ML to identify the top-100 and top-1 performing MOFs for ethylene and ethane uptake, respectively, based on the dataset from Daoo and Singh.30 It is evident that, in both cases, BO successfully identifies significantly more of the top-100 performing materials with the same number of samples. Furthermore, BO is able to identify the single best-performing MOF within the very first steps, whereas Random Sampling ML exhibits only marginal improvements throughout the search (up to 1600 samples). Even in the case of H2 uptake (Fig. 5(g) and (h)), based on the dataset reported by Majumdar et al.,29 where random sampling shows a comparable ability to BO in identifying top-100 materials after approximately 1000 samples, it fails to identify the best-performing MOF—even after 2000 samples—highlighting the greater efficiency of the BO strategy. Again, the nDCG values of BO remain consistently higher than those of random sampling across all three cases (Fig. S3), highlighting BO's superior ability not only to identify the region containing the top-performing materials, but also to rank them in a manner that more closely reflects their true performance.
The same strong performance in identifying top-performing MOFs is observed when our BO method is applied to the three datasets reported by Majumdar et al.,29 targeting CO2 working capacity, CO2/N2 selectivity, and CO2 uptake (Fig. 6). We draw the reader's attention particularly to the case of CO2/N2 selectivity, where the underlying distribution illustrates the difficulty of the task. Despite this challenge, BO achieves significantly higher identification performance and successfully discovers the best-performing MOF early in the search process. Once again, the nDCG values for BO are considerably higher than those for random sampling (Fig. S3), further demonstrating its superior ranking capabilities.
Finally, Fig. 7 presents the results for three additional datasets: O2/N2 diffusion selectivity in MOFs (from Orhan et al.28), H2 uptake in MOFs (from Villajos et al.31), and CH4 uptake in COFs (from Aksu and Keskin2). Due to the relatively smaller size of these datasets, we focused on the identification of the top-10 performing materials, reducing the evaluation threshold by an order of magnitude compared to previous cases. Even under this more stringent setting, our BO approach consistently outperforms random sampling, both in terms of identifying top performers and in ranking them effectively. BO successfully identifies a greater portion of the top-10 candidates with fewer samples, and—as confirmed by the nDCG scores—produces rankings that more closely reflect the true order of performance.
We envisage several concrete extensions to further enhance our framework's practical utility. First, integrating multi-objective BO methods—such as those by Kim et al.32 and Hoang et al.33—would enable simultaneous optimization of multiple performance criteria (e.g., selectivity vs. capacity). Second, replacing the Gaussian process surrogate with alternative probabilistic models (e.g., Bayesian Neural Networks34 or Gradient Boosting models with uncertainty estimation35) could alleviate the computational and scaling limitations of GPs. Third, human-in-the-loop strategies, as in HypBO,36 would allow domain experts to steer the sampling process in real time, potentially accelerating convergence in difficult regions. Finally, minimizing the effective design space—following the ZoMBI algorithm of Siemenn et al.17 —offers a promising route to reduce memory overhead, runtime, and the number of required samples, thereby addressing both the computational and the sampling costs. We are actively exploring these directions, though detailed implementation lies beyond the scope of this work.
Supplementary information: details of the XGBoost and Gaussian process models, comparison of efficient vs. random initialization, batch sampling performance, additional R², MSE and nDCG analyses, and computational setup used in this work. See DOI: https://doi.org/10.1039/d5dd00237k.
Footnote |
| † These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2025 |