Open Access Article
Adedire D. Adesiji†
a,
Jiashuo Wang†
b,
Cheng-Shu Kuoc and
Keith A. Brown
*abd
aDepartment of Mechanical Engineering, Boston University, Boston, MA, USA. E-mail: brownka@bu.edu
bDivision of Materials Science & Engineering, Boston University, Boston, MA, USA
cDepartment of Electrical and Computer Engineering, Boston University, Boston, MA, USA
dPhysics Department, Boston University, Boston, MA, USA
First published on 9th October 2025
A key goal of modern materials science is accelerating the pace of materials discovery. Self-driving labs, or systems that select experiments using machine learning and then execute them using automation, are designed to fulfil this promise by performing experiments faster, more intelligently, more reliably, and with richer metadata than conventional means. This review summarizes progress in understanding the degree to which SDLs accelerate learning by quantifying how much they reduce the number of experiments required for a given goal. The review begins by summarizing the theory underlying two key metrics, namely acceleration factor AF and enhancement factor EF, which quantify how much faster and better an algorithm is relative to a reference strategy. Next, we provide a comprehensive review of the literature, which reveals a wide range of AFs with a median of 6, and that tends to increase with the dimensionality of the space, reflecting an interesting blessing of dimensionality. In contrast, reported EF values vary by over two orders of magnitude, although they consistently peak at 10–20 experiments per dimension. To understand these results, we perform a series of simulated Bayesian optimization campaigns that reveal how EF depends upon the statistical properties of the parameter space while AF depends on its complexity. Collectively, these results reinforce the motivation for using SDLs by revealing their value across a wide range of material parameter spaces and provide a common language for quantifying and understanding this acceleration.
While SDLs are increasingly common, their value proposition has yet to be fully articulated, and different definitions and metrics have been proposed. Several of their virtues can be easily quantified and appreciated, such as how automation can allow additional experiments to be performed per unit time.17,18 A more subtle metric is how much they accelerate research, with reports ranging from 2× to 1000×.17 One reason for this challenge is that quantifying the acceleration of research progress requires comparing the advanced strategy to some reference strategy, often necessitating additional experiments that do not directly contribute to the domain science being explored. Nevertheless, studies have established and explored different metrics that quantify the degree to which AE improves research outcomes. Two metrics that stand out are acceleration factor AF and enhancement factor EF that describe how much faster or better one process is relative to another (Fig. 1B).19,20 These metrics are compatible with experimental campaigns as they do not require the parameter space to be fully explored or the optimum to be known. However, comparisons are not always possible because the values of these metrics are not always reported, they depend on the benchmark approach, and they depend sensitively on the details of the space being explored in a manner that has not been explored for materials.
In this paper, we review the existing experimental results that benchmark the acceleration inherent to SDLs and provide insight into how to interpret these metrics. We begin by defining EF and AF while providing the theoretical foundation for how these should behave in a typical active learning campaign. Next, we summarize the efforts in the community to provide experimental benchmarking. Finally, we perform basic simulations that provide context for interpreting EF in different parameter spaces. This review should help interpret acceleration values reported, provide guidance for the most impactful circumstances in which to apply active learning, and suggest future work in curating high-quality materials datasets for refining algorithms with direct application to materials science.
Here, y can be a scalar or a vector with the latter being the purview of multi-objective optimization. Like the majority of benchmarking, we consider scalar objectives for simplicity and adopt the language of maximization, although the same logic applies to minimization tasks. The input parameter space has a finite dimensionality d, and the variables can represent compositions, processing conditions, other conditions of the experiment, or even latent variables found using unsupervised learning. With these definitions, the goal of the campaign is to identify the conditions
After experiment number n in the campaign, the progress towards this goal can be quantified by considering how close the current observed maximum
is to the true maximum
Interestingly, if the campaign proceeds by selecting experiments uniformly at random across
this average progress has a closed-form solution that depends upon the cumulative distribution function Fy(y).19 Specifically, the average performance after n experiments corresponds to the performance at which there is a 50% chance that no larger value has been observed, or
![]() | (1) |
is the expected maximum performance from random sampling (i.e. we take an ensemble-median of the maximum of n samples drawn uniformly at random). At n = 1, y1 = median(y) and yn approaches y*(n) as n → ∞. This simple analysis illustrates that the convergence of a simple decision-making policy depends intimately on the details of the parameter space.
While it is reasonable to derive closed-form solutions for expected convergence when the property space is known, for real materials systems, y is unknowable except through experiment. The nature of continuous variables and the presence of noise in measurements mean that ground truth will never be completely known. This makes it impossible to predict how fast convergence is expected or even when the process has fully converged. Thus, benchmarking learning using an SDL involves completing two campaigns, an active learning (AL) campaign designed to test the learning algorithm along with a reference campaign guided by a standard method. From a benchmarking perspective, the most relevant data available are the best performance observed in the first n experiments, defined as
for the AL campaign and
for the reference campaign. There are two main ways of comparing these sets of data.19,20 The first metric is the acceleration factor (AF) that is defined as the ratio of n needed to achieve a given performance yAF, namely,
![]() | (2) |
while nref satisfies the same condition for the reference campaign. Larger values of AF indicate a more efficient AL process. The second metric is the enhancement factor (EF) that is defined as the improvement in performance after a given number of experiments, namely
![]() | (3) |
at n = 1. This leads us to define the contrast C of the property space as,
![]() | (4) |
The choice to define progress in terms of the maximum experimentally observed value rather than the maximum value predicted by a surrogate model deserves further discussion. Ultimately, these values will converge as a campaign progresses due to most optimization algorithms naturally including exploitative steps. However, the surrogate model may differ greatly from experiment especially early in the campaign. Thus, in order to confidently assess progress using the surrogate model, one would have to perform an experiment using the parameters that correspond to the maximum value predicted by the surrogate model. This is readily accessible for analytical functions but would double the experimental budget for experimental campaigns. If such validations are desired, an algorithm can mix in purely exploitative steps, but these should count as experiments in the experimental budget.
While “Bayesian optimization” was used as a keyword in the initial Scopus search, this literature analysis includes all SDL papers that report benchmarking including those that do not use BO. That said, almost all SDL studies used BO. Reinforcement learning21 and genetic algorithms22,23 have also been used and their results are included in this analysis.
Having narrowed down the field to a targeted set of papers considering experimental materials data, we set out to more fully compare this subset of the literature. The reviewed literature spans a diverse range of material domains, including electrochemistry,19,24–28 bulk materials discovery,23,24,29–35 spectroscopy and imaging,24,36 mechanics,37–41 nanoparticle and quantum dot synthesis,21,42–46 and solar cell or device optimization.22,24,32–34,47 This diversity underscores the breadth of SDL applications and highlights the variety of experimental contexts in which AF and EF are reported. This breadth of methods makes establishing reproducible results very important. Indeed, reproducibility is a key challenge in SDLs and efforts promote reproducibility have employed computer science abstractions,48 novel programming languages,49 and knowledge graph-based approaches.50 That said, the stochastic nature of active learning can lead to campaigns having different values of AF or EF even when performed in the same lab. Thus, while important, experimental benchmarking may not be the ideal method for evaluating reproducibility, at least on the basis of individual campaigns.
Retrospective analysis is the most common type of SDL benchmarking (Fig. 2B). For instance, Rohr et al. used a dataset of 2121 catalyst compositions collected using high-throughput experimentation spanning a six-dimensional electrocatalytic metal oxide space to benchmark various sequential learning models evaluated such as Gaussian process (GP), random forest (RF), and least-squares estimation (LE).19 The analysis, which was conducted over 1000 learning cycles, revealed up to a 20-fold reduction in the number of experiments required to find top-performing oxygen evolution reaction catalysts, comparing GP to random sampling. The study also evaluated the effect of exploration-exploitation tuning and dataset type on model performance. Similarly, Liu et al. developed an SDL to optimize the open-air perovskite solar cell manufacturing process and benchmarked its BO framework using a regression model trained on experimental data.34 They ran 300 iteration steps comparing standard BO and BO with knowledge constraint against Latin hypercube sampling (LHS), factorial sampling with progressive grid subdivision (FS-PGS), and one-variable-at-a-time sampling (OVATS). The BO methods consistently outperformed the others, showing up to a 10-fold enhancement in power conversion efficiency relative to LHS and FS-PGS.
Experimental benchmarks, while less common, are the most representative of real-world variability and experimental constraints. For example, Liu et al. had a limited budget of less than 100 process conditions, which limited experimental benchmarking to only standard BO vs. LHS.34 Within 85 process conditions, BO identified four times as many high-performing perovskite films as LHS. As a separate example, Wu et al. benchmarked the efficiency of a BO-guided gold tetrapod nanoparticle synthesis against random search over an experimental run of 30 iterations. The BO algorithm utilized in this work, Gryffin, uses a Bayesian neural network to construct a kernel regression surrogate model. The algorithm was benchmarked based on four hierarchical objectives related to the plasmonic response of the particles. While random sampling occasionally satisfied three of the objectives, it failed to meet the final objective within the experimental budget. One note about experimental benchmarking is that campaigns used for benchmarking often do not consider the experiments performed to establish the bounds of the parameter space or develop the SDL more generally. While this may be a substantial amount of work, it encourages researchers to use SDLs for prolonged campaigns to amortize this overhead.
Computational analyses, although sampled more selectively in this review due to our focus on benchmarking strategies that use experimental data, remain a valuable tool for comparing algorithmic strategies. Jiang et al. developed a chemical synthesis robot, AI-EDISON, for gold and silver nanoparticle synthesis with the goal of optimizing their optical properties.44 As part of their workflow, they benchmarked AI-EDISON against random search in a simulated chemical space using PyDScat-GPU, a simulation tool based on discrete dipole approximation-based simulations. During a campaign with 200 steps, the algorithm outperformed random search by the 27th step, identifying samples from nine of ten spectral classes and completing all ten by the 78th step. In terms of mean fitness, which measures the similarity of a sample's spectrum to the target, AI-EDISON reached the performance achieved by 200 random steps in just 25 iterations guided by the algorithm. Annevelink et al. likewise developed a framework for electrochemical systems, AutoMAT, with input generation from atomic descriptors to continuum device simulations such as PyBaMM.25 Compared to random search, AutoMAT found top-performing Li-metal electrolytes and nitrogen reduction reaction catalysts in 3 and 15 times fewer iterations respectively.
Across the reviewed SDL papers, which include 42 unique studies and 63 reported benchmarks, the most fundamental and widely adopted baseline is random sampling. MacLeod et al. evaluated their SDL, Ada, for multi-objective optimization of palladium film synthesis, balancing conductivity and annealing temperature.22 In a simulated campaign using a model built from experimental data, Ada's q-expected hypervolume improvement (q-EHVI) strategy achieved twice the hypervolume of random sampling within 25 steps and reached a hypervolume achieved by 10
000 random samples in just 100 steps. Similarly, Bai et al. developed a platform to explore the copper antimony sulfide (Cu–Sb–S) compositional space for photo-electrocatalytic hydrogen evolution. In this experimental benchmarking study, the Bayesian optimizer revealed a Cu–Sb–S composition that exhibited 2.3 times greater catalytic activity than results from random sampling.
Many SDL studies compare performance between algorithms, which frequently includes variants of BO (e.g., differing surrogate models, acquisition functions, or kernels),52 as well as hybridized approaches involving evolutionary algorithms,22,23 or reinforcement learning.21 For instance, Ziomek et al. proposed a length scale balancing GP-UCB (LB-GP-UCB), a BO variant with an upper confidence bound (UCB) acquisition function that aggregates multiple GPs with different length scales to address the challenge of unknown kernel hyperparameters.41 They retrospectively benchmarked the performance of LB-GP-UCB against adaptive GP-UCB (A-GP-UCB),62 maximum likelihood estimation (MLE),63 and Markov chain Monte Carlo (MCMC)64 using the crossed barrel37 and silver nanoparticle65 datasets. For both datasets, LB-GP-UCB consistently found the optimal solution with fewer experiments, specifically requiring 40% fewer trials than MLE and MCMC.
A relatively small number of studies reported performance relative to LHS and grid-based sampling. Gongora et al. developed the Bayesian experimental autonomous researcher (BEAR) to optimize the toughness of crossed barrel structures.37,60 They benchmarked its performance against grid sampling, where the 4D design space was discretized into 600 points, each tested in triplicate. The BEAR running on a BO framework with an expected improvement (EI) acquisition function discovered higher-performing structures with 18 times fewer experiments. Also, Bateni et al. developed an SDL, Smart Dope, for space exploration and optimization of lead halide perovskite (LHP) quantum dots (QDs).42 Using LHS, 150 initial experiments were conducted across the nine-dimensional space to generate training data for closed-loop optimization. Smart Dope, also running BO with an expected improvement acquisition function, achieved a photoluminescence quantum yield (PLQY) of 158% after just four closed-loop iterations, exceeding the 151% maximum obtained by LHS. This suggests that LHS and grid-based sampling's fixed intervals may over-represent flat regions while missing sharp transitions.
Human-directed sampling, where expert researchers select experimental conditions based on intuition and domain knowledge, also appears in the reviewed SDL literature, and it provides a useful comparison between SDLs and conventional experimentation. Nakayama et al. benchmarked BO against human-directed sampling using a one-dimensional model of synthesis temperature optimization.51 Human experts required 13–14 trials to find the global optimum, while BO required only ten steps with the appropriate acquisition function and hyperparameters. The search efficiency of BO demonstrated in this simple 1D case is expected to grow in higher-dimensional spaces where human intuition is more limited. Sheilds et al. benchmarked the performance of BO against 50 expert chemists using high-throughput experimental data covering a ten-dimensional parameter space for optimizing the yield of direct arylation of imidazoles.59 To reduce bias, the performance was averaged across the 50 human participants and 50 runs of the Bayesian optimizer, each conducted over 100 steps. While humans achieved 15% higher yield in the first five experiments, by the 15th experiment, the average performance of the optimizer surpassed that of the humans. BO consistently achieved >99% yield within the experimental budget, and within the first 50 experiments, it discovered the global optimum that none of the experts found.
It should be noted that, as the field matures, future work may focus more on comparing advanced strategies to one another rather than comparing advanced algorithms to comparatively inefficient benchmarking approaches such as random sampling. While this is a valuable pursuit and highly relevant to accelerating materials discovery, it may make it challenging to compare the values reported by different studies. Fortunately, metrics such as AF and EF can be applied in a multiplicative fashion if compared at specific y or n, respectively. Thus, it may be possible to relate such advanced comparisons back to random sampling, which has the advantage of being a deterministic function of the cumulative distribution function of a property space.
![]() | ||
| Fig. 3 Acceleration factor (AF) vs. input parameter space dimensionality d across benchmarking SDL studies, with corresponding AF frequency. | ||
| Case | Source | AF | Type | Dimension | Comparison | Objective |
|---|---|---|---|---|---|---|
| 1 | Bateni et al.42 | 37.5 | Experimental | 9 | GP-EI vs. LHS | Photoluminescence quantum yield |
| 2 | Cakan et al.32 | 2.5 | Experimental | 3 | GP-EI vs. grid | Film photothermal stability |
| 3 | Fatehi et al.28 | 20 | Experimental | 4 | GP-EI & GP-UCB vs. random search | Catalyst activity |
| 4 | Gongora et al.37 | 18 | Experimental | 4 | GP-EI vs. grid (best grid performance as reference) | Structure toughness |
| 5 | Gongora et al.37 | 56.25 | Experimental | 4 | GP-EI vs. grid (best BO performance within a time budget as reference) | Structure toughness |
| 6 | Gongora et al.38 | 10 | Experimental | 4 | GP-EI (FEA informed) vs. GP-EI (uninformative prior) | Structure toughness |
| 7 | Wu et al.45 | 10 | Experimental | 7 | Gryffin algorithm (BO based on kernel density estimation) vs. random search | Nanoparticle plasmonic response |
| 8 | Borg et al.29 | 2 | Retrospective | 3 | RF-EI & RF-EV (expected value) vs. random search (identifying single target material) | Band gap of inorganics |
| 9 | Borg et al.29 | 4 | Retrospective | 3 | RF-EI & RF-EV vs. random search (identifying five target materials) | Band gap of inorganics |
| 10 | Dave et al.26 | 1.3 | Retrospective | 3 | Random search vs. human | Electrolyte ionic conductivity |
| 11 | Dave et al.26 | 6 | Retrospective | 3 | GP-MLE vs. random search | Electrolyte ionic conductivity |
| 12 | Guay-Hottin et al.52 | 1.42 | Retrospective | 4 | α-πBO (GP-EI with dynamic hyperparameter tuning) vs. standard GP-EI | Structure toughness |
| 13 | Langner et al.33 | 33 | Retrospective | 4 | Bayesian neural network (BNN) vs. grid | Film photostability |
| 14 | Liang et al.20 | 2 | Retrospective | 4 | GP-ARD (automatic relevance detection)-LCB vs. random search | Structure toughness |
| 15 | Liang et al.20 | 8 | Retrospective | 4 | RF-LCB (lower confidence bound) vs. random search | Structure toughness |
| 16 | Liang et al.20 | 4 | Retrospective | 4 | GP-LCB (lower confidence bound) vs. random search | Structure toughness |
| 17 | Liu et al.34 | 61 | Retrospective | 6 | Standard BO & knowledge-constrained BO vs. LHS | Film power conversion efficiency |
| 18 | Lookman et al.31 | 3 | Retrospective | 7 | GP-EI vs. random search | Material electrostrain |
| 19 | Low et al.23 | 5 | Retrospective | 8 | qNEHVI (q-noisy expected hypervolume improvement) vs. U-NSGA-III (unified non-dominated sorting genetic algorithm III) | Concrete slump & compressive strength |
| 20 | Low et al.23 | 20 | Retrospective | 4 | qNEHVI vs. U-NSGA-III | Film conductivity & annealing temperature |
| 21 | MacLeod et al.22 | 100 | Retrospective | 4 | qEHVI (q-expected hypervolume improvement) vs. random search | Film conductivity & annealing temperature |
| 22 | Rohr et al.19 | 10 | Retrospective | 6 | RF-UCB & GP-UCB vs. random search | Catalyst activity |
| 23 | Rohr et al.19 | 5 | Retrospective | 6 | LE (linear ensemble) vs. random search | Catalyst activity |
| 24 | Ros et al.35 | 5 | Retrospective | 6 | GP-EI-Thompson sampling & vs. random search | Drug solubility |
| 25 | Thelen et al.27 | 5 | Retrospective | 4 | GP-EI & GP-PI (probability of improvement) vs. random search | Battery cycle life |
| 26 | Thelen et al.27 | 2 | Retrospective | 4 | GP-UCB vs. random search | Battery cycle life |
| 27 | Ament et al.24 | 25 | Computational | 3 | GP-IGU (integrated gradient uncertainty) vs. random search | Phase boundary mapping |
| 28 | Annevelink et al.25 | 3 | Computational | 5 | AutoMat-FUELS (forests with uncertainty estimates for learning sequentially) vs. random search | Catalyst activity |
| 29 | Annevelink et al.25 | 15 | Computational | 10 | AutoMat-FUELS vs. random search | Battery cycle life |
| 30 | Jiang et al.44 | 7.41 | Computational | 5 | Quality diversity (QD) algorithm vs. random search | Nanoparticle extinction spectra |
| 31 | Lei et al.60 | 8 | Computational | 10 | BART (Bayesian additive regression trees) & BMARS (Bayesian multivariate adaptive regression splines) vs. standard BO | Crystal stacking fault energy |
| 32 | Lookman et al.31 | 2 | Computational | 6 | GP-EI vs. RF + EI | LED quantum efficiency |
| 33 | Nakayama et al.51 | 1.3 | Computational | 1 | GP-EI vs. human | Synthesis temperature |
While AF is simple to report, it is subtle to interpret as it depends on the chosen performance threshold. Typically, this threshold corresponds either to a value defined by the researcher or the highest performance achieved during the campaign.22,32 In contrast, EF is easy to calculate at each experiment, and it does not rely on a performance value, making it useful for tracking learning progress.
In order to visualize EF progression over the course of SDL campaigns, we extracted EF from reported performance trajectories (Fig. 4). We limited this analysis to studies that benchmarked against random sampling since this can serve as a common baseline. To enable comparison across studies with different d, we divided experiment number n by d. We focused specifically on experimental and retrospective benchmarking studies, as these are grounded in real experimental data. Examining the computed EF values, a consistent pattern emerges in which EF initially grows with n/d, reaches a peak, and then gradually declines. This indicates that the benefit from active learning is most important early in a campaign, where the algorithm can make rapid progress towards the chosen goal. At higher numbers of experiments, the diminishing marginal gains of active learning combined with the continual progress of random sampling mean that the benefit of active learning becomes less important. In other words, if enough of the parameter space will be sampled, the order in which it is sampled is not important. Interestingly, this peak in EF occurs at ∼10 to 20 experiments per dimension, which provides a useful reference point for the SDL community when planning campaigns. It is worth emphasizing that while EF measured for the SDLs did peak, it was not reported to be worse than random sampling at large numbers of experiments. In order to fully capture the acceleration inherent to SDLs, it would be useful to use multi-fidelity learning or early stopping criteria chosen using simulations.66,67
While the number of experiments at which EF peaked was relatively consistent, the peak value of EF varied substantially between studies. However, the difference in magnitude is largely due to two separate metrics both being described as EF. Studies shown as solid lines define EF based upon the enhancement in the property, as we define in Section 2. These studies all have magnitudes in range 1 to 2. The analysis in Section 2 reveals that the maximum attainable value for EF computed in this way is C, which depends on the property space. For instance, Zhu et al.40 using experimental design via Bayesian optimization package (EDBO)59 and Li et al.39 using graph-based Bayesian optimization with pseudo labeling (GBOPL), both benchmarked their algorithms on the crossed barrel dataset, to find modest maximum EF of 1.2 and 1.1, reflecting the narrower performance gap in this property space. This is similar to the EF of 1.2 observed in the experimental benchmarking study by Gongora et al.,37 the source of the dataset. In contrast, studies shown as dashed lines define EF as the enhancement in the number of high-performing combinations of parameters that have been found. These studies have much larger magnitudes. For example, the largest EF observed in our analysis was 23, reported by Fatehi et al.,28 who applied a Bayesian optimization framework with a UCB acquisition function to quantify the proportion of top-performing oxygen evolution reaction (OER) catalysts identified relative to random sampling, using the dataset by Rohr et al.19 While these variations on EF ultimately quantify different things, choosing between them ultimately reflects the priority of the campaign.
![]() | ||
Fig. 5 Simulated Bayesian optimization (BO) campaigns to explore how the property space dictates convergence. (A) Five two-dimensional functions f under consideration that differ only in their contrast C = max(f)/median(f). While all are two-dimensional, they depend on x1 and x2 in the same way and x2 = 0.5 is shown. (B) Simulated horse race plot showing the convergence of BO and random sampling for function fC2. Theory corresponds to eqn (1). The shaded regions show interquartile ranges. (C) EF vs. n for the five functions shown in (A). (D) max(EF) relating BO and random sampling vs. C. Dashed line shows a fit to max(EF) = (αC + 1 − α)/(βC + 1 − β). (E) AF vs. y for the five functions shown in (A) showing that they stop at commensurate values. For all functions, AF is plotted until is within 0.01 of y* (i.e. surpassed the 99.94th percentile of the function). | ||
In a first round of simulations to explore the magnitude of max(EF), we performed optimization campaigns using five functions that differed only in their contrast C (Fig. 5A). As expected, all campaigns achieved a max(EF) at similar n but exhibited very different magnitudes depending on the function (Fig. 5C). Indeed, the theoretical and computed max(EF) followed identical trends and monotonically increased with C (Fig. 5D). These points are fit to max(EF) = (αC + 1 − α)/(βC + 1 − β), which reflects the expected EF comparing two campaigns whose rate of convergence does not depend directly on C. This analysis confirms that while the complexity of the function dictates how many samples are needed to find the optima, its C bounds EF, partially explaining why the literature features such a wide range in reported max(EF).
While EF clearly depends on many C, the progression of AF throughout a campaign along with its maximum does not (Fig. 5E). In particular, AF is found to monotonically increase throughout a campaign and reach a maximum when the learning algorithm has found y*. Two facets of this trajectory make AF more suitable than EF as a metric for broad comparison. First, being equivariant with the output space is congruent with our expectations that shifting by a constant should not affect the quality of a learning algorithm. Second, being monotonic means makes it easier to compare campaigns with a single value, namely max(AF).
While the functions explored in Fig. 5 exhibited the same complexity, we sought to explore whether one can use simple statistics of a function to gain insight into how many experiments are needed to achieve optimum performance. In particular, we explore Lipschitz complexity L, which is defined as,68
| L = max|∇f|, | (5) |
from the literature appears to be ∼15/d, which amounts to 30 experiments in the present example. This suggests that the functions explored here share statistical features with the materials spaces previously studied. Importantly, max(EF) increases with L, highlighting that it is more impactful to use active learning in parameter spaces that are more difficult to learn.
![]() | ||
Fig. 6 Simulated BO campaigns to explore how property space complexity impacts learning. (A) Five two-dimensional parameter spaces f under consideration that differ only in their Lipschitz complexity L, as defined in eqn (5). While all are two-dimensional, they depend on x1 and x2 in the same manner and x2 = 0.5 is shown. (B) EF vs. n for the five functions shown in (A). (C) Optimum experiment number corresponding to max(EF) vs. L. The dashed line shows a linear fit. (D) max(EF) vs. noise standard deviation σ normalized by median(y). (E) and vs. σ normalized by median(y). | ||
The analytical spaces considered here are deterministic, while experimental parameter spaces will necessarily feature noise. In an effort to understand how the presence of noise will impact convergence, simulated BO campaigns were repeated for the functions shown in Fig. 6A with homoscedastic Gaussian noise with standard deviation σ added. Both max(EF) and
had a smooth dependence on σ (Fig. 6D), with the most complex functions exhibiting drastic increases in
(Fig. 6E). This result indicates that reducing noise becomes more important the more complex the parameter space. The observation that noise slows convergence is consistent with prior analytical work.69 The range of noise was chosen to be analogous to the range of noise typically found in experimental systems. It is not expected that larger values of noise will change the trends observed, only that it would make the simulations take longer to converge.
While these heuristic simulations have focused on single objectives, many recent SDLs focus on multiple objectives simultaneously. That said, the most widely-used approach for multi-objective optimization is hypervolume optimization wherein the algorithm seeks to maximally improve the Pareto front balancing all objectives.22,23 In many ways, once this type of problem has been transitioned into a scalar optimization (i.e. maximizing hypervolume), the same types of benchmarking could be done to compare performance of an active learning algorithm and a reference process. Simulations of such processes reveal similar non-monotonic behavior of EF and monotonically increasing AF,22 suggesting that the principles studied here apply to multi-objective cases as well.
These heuristic simulations have provided context for how to interpret AF and EF values generated by SDL campaigns and guidance for how reparameterizing the input or output space may affect convergence. One reason why EF is an imperfect metric is that shifting the output space by a constant will change EF but not impact the actual learning rate. In contrast, applying a non-linear transform to the property space that reduces L is likely to accelerate convergence. Analogously, narrowing parameter space to focus on regions of interest will similarly reduce L, which provides a mechanism for understanding how approaches such as ZoMBI improve learning.70 AF is likely a more useful metric for comparing algorithms, but it still depends on the length of campaigns and, being monotonic, it does not help experimentalists determine the point of diminishing returns for additional experiments.
(i) SDLs achieve top-performing results on average six times faster than random sampling, and this acceleration improves with the dimensionality of the parameter space.
(ii) The enhancement inherent to SDLs is reported to peak at 10–20 experiments per dimension of parameter space, with enhancement factors that vary tremendously depending on the space.
It is important to highlight that both of these outcomes depend intimately on the nature of the property spaces, but the fact that these all represent actual experimental materials datasets suggests that they are useful guidelines for the field. Further, simulated campaigns in analytical spaces reveal key features of how to interpret metrics, namely that EF can simply be related to the statistics of the parameter space such as its contrast, the complexity of the space determines the speed with which convergence can be expected, and that noise affects AF more than EF. Despite the simplicity of the heuristic simulations presented here, the fact that they required similar numbers of experiments to converge relative to what is seen in the SDL literature suggests that these functions share statistical features with studied materials systems. With the growing confidence and expertise present in the SDL field, researchers will undoubtedly explore much more complex spaces going forward. While the specific values in this study will hopefully be improved upon in the coming years as more advanced algorithms are employed, they nevertheless provide a valuable snapshot of the field and a useful tool to align progress. While there are many ways to parameterize a function that might be useful to contextualize benchmarking, we have focused on contrast as defined in eqn (4) and the Lipschitz constant. The former directly bounds EF and is a very straightforward property to compute while the latter is widely used in machine learning to evaluate models.68,71–74 Other factors can play an important role in optimization such as multimodality (having multiple local optima) or anisotropy (having very different gradients in different directions). The presence of these and other factors emphasizes that the simulations shown here are heuristic and more in-depth study is needed. Addressing the materials challenges facing our society demands rapid progress and a thorough analysis of methods to accelerate this progress is necessary to move the field forward.
Footnote |
| † These authors contributed equally. |
| This journal is © The Royal Society of Chemistry 2025 |