Open Access Article
Kelsey L.
Snapp
a and
Keith A.
Brown
*ab
aDepartment of Mechanical Engineering, Boston University, Boston, MA, USA. E-mail: brownka@bu.edu
bDivision of Materials Science & Engineering and Physics Department, Boston University, Boston, MA, USA
First published on 19th September 2023
Self-driving labs (SDLs) have emerged as a strategy for accelerating materials and chemical research. While such systems autonomously select and perform physical experiments, this does not mean that the human experimenter has no role to play. Instead, the experimenter must monitor progress of the SDL and make adjustments to ensure that the SDL progresses towards the chosen goal. Unfortunately, researchers rarely receive training specific to the unique challenges inherent to operating SDLs. Here, we provide a heuristic framework for SDL operators. In analogy with how a human might operate a car or other complex system, this framework defines the knobs and the gauges, or the SDL settings that can be modified and the information that an experimenter can consider to change these settings. These lessons are discussed in the context of a common optimization strategy (Bayesian optimization using Gaussian process regression) but can be generalized to other models. Crucially, these adjustments constitute fine tunings that can occur at any point during an SDL campaign, allowing the experimenter to participate in this closed-loop process without being in the loop. As the framework introduced here is material-system agnostic, it can form a resource for researchers developing or using SDLs in any discipline.
The fact that SDLs are self driving does not mean that they cannot have input from people. Indeed, the construction and programming of SDLs is inherently a human process in which people have provided input, codifying their priorities and goals. This process itself can be complex and important as it involves turning the human domain knowledge into an algorithmic format that can be utilized by the SDL.27 Nevertheless, for most SDL campaigns published to date, this initialization is the end of the meaningful human interaction outside of restocking reagents or other feedstock materials. In particular, the SDL iteratively selects and performs experiments based on the programming until the allotted number of experiments has been reached. While this “hands-off’” approach is reasonable when considering SDL campaigns that collect data over a few days, it is increasingly unrealistic as the timeframe of campaigns stretches into weeks and months. Under these conditions, it is very natural for the human experimenters to monitor the SDL and make adjustments. Indeed, there have been recent reports highlighting the importance of the human–machine collaboration and how this collaboration can be used as part of otherwise autonomous systems.28,29 However, given that we are only now seeing the widespread adoption of SDLs, there do not yet exist resources to help experimenters know what to monitor and what to adjust.
Here, we present a take on driving school for SDLs based on our experience running the Bayesian experimental autonomous researcher (BEAR). In particular, we have recently concluded a campaign in which we performed >25
000 experiments using this SDL over the span of two years.26 This campaign was a highly dynamic process in which many settings were adjusted based on feedback obtained by monitoring progress made by the SDL. In this manuscript, we detail our key learnings from this process and codify two important aspects, namely the choices an experimenter needs to make and the information that they should be monitoring to make these choices. We narrow these items down into six settings that must be chosen and four plots that should be monitored periodically throughout the campaign to adjust these settings. Despite our insights coming from a campaign based on studying the mechanical properties of additively manufactured structures, the lessons learned are largely system agnostic. We discuss these lessons using the commonly adopted framework of Bayesian optimization to aid in their adoption by others. While the heuristics presented herein cannot account for every situation faced when operating SDLs, we hope that the language, principles, and processes can provide experimenters with greater intuition and increase the adoption and effectiveness of SDLs across the materials space.
. The objective or ground truth function y = f(
) is unknown and heteroskedastic (variance is not constant across parameter space). The first step of BO is to use the available knowledge to condition a surrogate model f(
;
), or a probabilistic approximation of the objective function that depends on hyperparameters
. Available knowledge can include, for example, experimental data, physical knowledge about the system such as expected symmetries, or related prior knowledge such as simulation results. Second, the surrogate model is used as an input to an acquisition function a(
;
) which quantifies the expected benefit of performing an experiment at any location in
. The maximum of a across
is generally selected as the next experiment.
Each step of BO requires the experimenter to set a number of parameters whose values may not be obvious or physically intuitive. Crucially, the BO process, and the operation of the SDL more generally, need not be static, and the experimenter should continually revisit settings based on recent results to improve the chances of converging towards the best outcome. Below, we highlight the actions that the experimenter can take (Section 3) and then detail the feedback that forms the basis for how to choose which actions to perform (Section 4).
that is considered as the parameter space is a non-trivial choice that can be crucial to the success of an SDL campaign. In physical experiments, each component of
corresponds to a tangible physical quantity such as the identity of a compound, a processing temperature, or the length of a geometric feature. As such, not all permutations of
correspond to valid experiments and there is no guarantee that the space of valid experiments is a convex region. In the simplest case, when each component of
is independent, the full parameter space can be considered a hypercube. Even in this case, determining the maximum and minimum values for each parameter is often a judgement call based on experimenter intuition. More generally, there exists some geometrically complicated domain in
that corresponds to valid experiments. If the boundary of valid experiments is known at the beginning of the campaign, invalid experiments can be avoided. In contrast, if the boundary is unknown but the system is such that invalid experiments can be attempted, the boundary of this domain can itself be learned during the campaign.32 In addition to the range of parameters considered, it can be useful to transform the parameter space such that correlations are easier to learn. For instance, monotonic transformations such as taking the logarithm of a parameter can facilitate learning if the parameter is positive definite (always greater than zero) and varies across several orders of magnitude.
) into the predicted amount of additional hypervolume in the output space that is expected to be enclosed from that experiment.
Even when only a single metric is important, there are impactful choices to make in how this variable is quantified. While linear transformations such as normalization should not affect SDL progress, provided that the numbers are not made so large or small such that they lead to numerical precision errors, non-linear transformations can also be used to compress or expand important domains in y. For example, when y has a finite range (say an efficiency that varies from 0 to 1), the model will often predict non-physical values (i.e. ỹ> 1 or ỹ< 0). Thus performing transformations that enforce these physical constraints can be helpful. For example, building a model to predict y′ = arctanh(2y − 1) both preserves the physical constraints on y while expanding the domains associated with high and low values.26
. The most commonly applied model is Gaussian process regression (GPR). This model is highly applicable in the low- to medium-data regimes and is relatively free of bias, despite only being able to model surfaces that are infinitely differentiable.
To gain intuition about the function of a GPR, it can be thought of as a tool for predicting the expectation ỹ(
) at any
as the weighted average of the prior belief μ0(
) and the results of all prior experiments yi(
i). Commonly, the prior is assumed to be constant and therefore usually considered uninformative, although more advanced methods can be used such as setting this prior using low-fidelity measurements including simulation-based approximations of the experimental function.19 The weighting used for this prior is based on a hyperparameter θ0, which represents the total range of experimentally realizable values. More interesting are the weights chosen for each experimental measurement. The underlying assumption is that points closer to the sampling point are more highly correlated, although each dimension of
could feature correlations with different length scales. A common strategy is to define a kernel function that models how correlations vary in space with common choices being squared exponential or Matérn functions.39 Such functions feature a set of hyperparameters
that represent the distance in each dimension over which the output is expected to be highly correlated. This kernel function, together with the physical measurement uncertainty σy, together determine the weighting of each previously measured experiment. A similar process can be used to predict the uncertainty in ỹ(
), termed
(
). The hyperparameters
, θ0, σy are crucial to GPR performance and these hyperparameters are typically set using maximum likelihood estimation. It is important to note that σy is an estimation of the measurement uncertainty used by the GPR and it can either be independently estimated by performing control experiments or it can be chosen using statistical methods such as maximum likelihood estimation. Either way, the decision to use a single value of σy across the whole parameters space is an approximation that deserves careful consideration. For a more nuanced look at GPRs, their mathematical underpinnings have been outlined in prior publications.31,40
It should be noted that there are vastly more sophisticated approaches that can be taken than the simple GPR-based approach described above. For instance, rather than assuming that correlations between points are translationally invariant (that correlations depend solely on distance between points but not on their location in parameter space), one can define non-stationary kernel functions that allow correlations that capture nuances in the parameter space itself.41–44 Additionally, there are hybrid methods such as deep kernel learning (DKL) in which a deep neural network is trained to transform the input data to a GPR.45,46 This concatenation of models makes the system able to accommodate variable length-scale correlations and discontinuities in the input space.
In addition to data-driven methods to modify the expected correlations in space, there are considerable opportunities for the experimenter to bring physical intuition into the modeling effort. In particular, it is possible to transform the input space based on compound variables that are expected to more closely connect to the physical output. Simple examples of this process could include dividing the masses of reagents by their volumes to learn on concentrations or taking the ratio of geometric parameters to learn on aspect ratios.26 While such anecdotes appeal to our intuition, the choice of how to represent the variables that define the parameter space is an underappreciated facet of SDL operation and deserves concerted study.47
) and
(
) into a scalar quantity that relates to the predicted benefit of sampling at that location. A key consideration here is balancing the needs of exploration and exploitation.31 Exploration prioritizes sampling in regions with little prior information while exploitation prioritizes sampling in regions believed to be high performing as a means of optimizing this performance. One acquisition function that displays this dichotomy well is the upper confidence bound (UCB) in which aUCB = ỹ(
) + λ
2(
), where λ reflects a weighting in which λ = 0 would be considered pure exploitation while λ → ∞ would be considered pure exploration. Other commonly used acquisition functions are expected improvement (EI), in which aEI is proportional to the amount by which the sampling point is expected to be above the highest previously observed point. Like surrogate modeling, there have been innovations in the development of advanced acquisition policies, such as those that incorporate simulation data19,32 or expert knowledge.48
Another consideration when choosing an acquisition function is whether experiments are performed individually or in batches.49–52 When experiments are performed in batches, it is necessary to take action to make sure that each point will target a different region in
. One approach, for instance, is to select the first experiment in the batch, then recondition the GPR assuming that the point was sampled and either returned the GPR mean (Kriging believer) or a constant value (constant liar), and use this updated GPR to predict the next point iteratively until all batch points have been selected.50 Alternatively, a simple penalty can be applied after selecting each experiment to force subsequent experiments to be further away in
.
to evaluate as candidate experiments. Exhaustively evaluating every possible location is nearly always impractical, especially when dealing with continuous variables. Thus, it becomes necessary to choose a subset of points to sample.53 Grid and random sampling are unfavorable for opposite reasons: Grid based search will fail to locate a maximum point if it is not on the grid, while random sampling fails to ensure evenly spaced sampling points (Fig. 2a and b). In contrast, Latin hypercube sampling (LHS) presents a way to cover all of parameter space while introducing some randomness that prevents the same precise point from being reconsidered multiple times in subsequent steps (Fig. 2c). In terms of how many points to select, the curse of dimensionality plays a major role. Specifically, if
has d dimensions, adding 10 conditions per d leads to ∼10d points, which becomes intractable when d > 7 despite 10 conditions providing only a sparse sampling of space. Thus, in any reasonably high dimensional space, it becomes necessary to couple an initial sampling with a refinement or optimization process. For example, a second cluster of LHS points may be collected near the maximum observed during the first round (Fig. 2d). Alternatively. The maximum or maxima of the first round can be used as input points to other optimization algorithms such as gradient ascent or genetic algorithms to find local maxima of a.33
An interesting related point is how to deal with outliers. Algorithmically rejecting outliers is a risky strategy as in a needle-in-a-haystack type search, the needle is likely to look like an outlier. For this reason, it is easier to justify including or excluding data based on its position in
rather than its measured value of y. That said, all available data and characterization should be used to assess whether a given experiment was conducted correctly and therefore whether the measured value can be trusted.
with the most recently added data highlighted. Older data points can be colored to illustrate the order in which they were collected (Fig. 3). Looking at one dimensional slices of a multidimensional space can only offer limited insight, but one thing that such plots can do quite well is illustrate whether top performing points are on a boundary. Such an observation would be an indication that the boundaries of
should be enlarged if possible to capture this trend, thus providing feedback on parameter range (A). We note that if boundaries are not independent, rather than plotting y vs. each component of
, it may be more useful to plot y vs. the distance from the nearest boundary.
![]() | ||
| Fig. 3 Response scatter points. (a) Observed response y vs. x1. Points colored based on the order that they were tested (light gray to black), with the most recent test data shown as a blue star. In this panel, the peak of y is firmly within the boundaries of x1, indicating that the range is likely appropriate. (b) Observed y vs. x2. In contrast with x1 which is a design variable that can be chosen continuously, x2 corresponds to a material property and therefore only discrete values corresponding to known materials can be selected. In this panel, the peak of y is on the boundary that cannot be expanded due to material constraints, pointing to potential material development goals. Data corresponds to experiments and is adapted from ref. 26. | ||
![]() | ||
Fig. 4 Parity plots. (a) Predicted response ỹvs. y built from 13 250 data points. Red line represent ỹ= y and pink lines represent estimated measurement error σy. Blue star represents the latest experimental prediction and the subsequent result. (b) Same model as a, but zoomed in on the upper section of the data to highlight the flat plateau that occurs at ỹ≈ 0.7, leading to significant underprediction of performance for the latest experiment. (c) Model built from 4060 data points near max(y) shows improved performance at high values, which are the most important in a maximization problem. Predicted value of σy is also decreased. (d) Model built from 410 data points illustrating deteriorated performance and a very small estimated σy due to too little data. Data corresponds to experiments and is adapted from ref. 26. | ||
The overall behavior of the parity plot is also crucial in validating the surrogate model (C). Two aspects of fit quality should be immediately apparent when examining the parity plot. First, the spread of the data about the trend line provides an excellent proxy for prediction accuracy and this can be quantified using the square of the Pearson's correlation coefficient r2. Since these are in-training sample predictions, training conditions exist for a perfect match between model and data. However, it is important to remember that the model is Bayesian with the expectation that experimental data has uncertainty and thus the spread about the parity line should approach σy. As such, one would expect ∼65% of data points to fall between lines parallel to the parity line but spaced apart by ±σy, shown as light red lines in Fig. 4. If nearly all of the data falls between these lines, this is a signal that the model is overfitting. While generally r2 will increase as the volume of data increases, it is also highly dependent upon the model hyperparameters θ and any transformations made to the input space. Thus, low r2 reflects the need to pay careful attention to these terms. We recommend making decisions to set these terms based upon minimizing cross-validation error or through maximum likelihood estimation. In addition to spread about the parity line, if the residuals of the data are not uniformly distributed, it may reflect an incorrect bias in the model. As GPRs are nearly bias-free, they should not exhibit this phenomenon.
The parity plot can also clearly communicate what data should be included in the model (F). In particular, for maximization problems, the data that reflects the largest values of y are in many ways the most important. A common occurrence with GPRs is that they will effectively smooth out sharp peaks in parameter space due to their inability to model wide low-performing regions while also modeling narrow high-performing regions, or needles in a haystack. This problem is manifest in the parity plot through a flattening of ỹat the high end (Fig. 4b). Under these circumstances, it will be very challenging to tease out accurate predictions in high-performing regions. One solution is to adjust the data used to train the model (F). In particular, one can retrain the model using only the data in the proximity of the high-performing points, thus ensuring that the model accurately captures this region (Fig. 4c). Variations exist to this approach such as omitting data based on y (only including high performing experiments), the location in
relative to argmax(y) (only including experiments near the best performing experiment), or the location in
relative to argmax(ỹ) (only including experiments near the best predicted sampling point of an initial GPR model). However, if too few data points are included, the model performance can deteriorate (Fig. 4d).
where
is the location of the most recently sampled point (Fig. 5) and the subscript j corresponds to an iterator over each dimension in the parameter space. A virtue of plotting data in this way is that it makes it very clear the degree to which the SDL is prioritizing exploration vs. exploitation for the most recent experiment. Simply – the further the sampled point is from nearby points, the more exploration focused the choice of the policy. This makes such plots very useful for evaluating the behavior of the decision policy (D) and ensuring that it is behaving as expected. This is also very useful when using policies like EI that themselves balance exploration and exploitation. The distance of the nearest point is a clear indicator of how the policy is resolving this tradeoff. In addition to providing feedback into the acquisition function, proximity plots can be useful in helping select the data used to train the model (F). When training a model with some subsample of the available data, it is useful to indicate this data on the proximity plot to illustrate the complexity of the subspace under consideration.
![]() | ||
Fig. 5 Proximity plots. Observed y vs. distance to selected point ρ. Black points represent previous experiments, while the blue star represents the selected experiment and its predicted performance ỹ. The distance of the point closest to the star is an indicator of the exploration/exploitation tradeoff. In this panels, this tradeoff is illustrated by selecting a point based on upper confidence bound with varied hyperparameter λ. (a) λ = 0, pure exploitation. (b) λ = 1, balance of exploration/exploitation. (c) λ = 5, focus on exploration. (d) λ = 25, nearly pure exploration. Data corresponds to experiments and is adapted from ref. 26. | ||
Despite the utility of proximity plots, a question arises in how to normalize ρ to make the data most meaningful. For instance, considering a normalized d-dimensional parameter space, the furthest two points can be from one another is √d, making Euclidian distance in a normalized space a very useful way of thinking about distances. That said, the relevance of distances depends on the topography of the response surface and not just the chosen bounds. One way to take this into consideration when using a GPR is to normalize each dimension by the length scale hyperparameter θj associated with that dimension. In this way, the magnitude of ρ is closely connected to covariance with ρ ∼2 indicating that only 5% of the sample point's variance is captured by the neighbor. Unfortunately, this approach can be challenging as
can change and in particular shrink as additional data becomes available that elucidates high-spatial-frequency variations. This makes it hard to track the performance of this plot over time. Interestingly, when using proximity plots normalized by GPR hyperparameters, the behavior of points ρ < 2 illustrates how well the GPR is performing in the experimental region. If y is spread widely despite being considered statistically close by the GPR, it is a signal that the model is not performing well in the region of interest and the surrogate model (C) needs to be reevaluated.
, unsupervised approaches such as principle component analysis or autoencoders may be useful for reducing the performance space to two variables. When optimizing a scalar metric, it is useful to introduce another value that provides a second dimension for visualization. Ideally, this second metric would be related to the first, but not directly connected. For example, in a campaign to optimize toughness, material stiffness could be a useful second property for visualization as it would help distinguish tough but compliant materials from brittle but stiff materials, even if they both had the same toughness.
![]() | ||
| Fig. 6 Performance gamut. (a) Experimental data (black dots) are represented by two dimensions: y1 and y2. (b) To find the maximum of the acquisition function, LHS is used to evaluate sampling points (orange) using the zoomed-in GPR model from Fig. 4c. (c) A second pass of LHS (dark orange) is done with the same GPR model but in a smaller region to more closely locate the maximum of the aquisition function. Data corresponds to experiments and is adapted from ref. 26. | ||
In addition to plotting the actual values of y, it is useful to plot on the same axis the ỹthat were in consideration for the most recent experiment as this can be useful for determining the number of sampling points used when finding the maximum of the acquisition function (D). The predicted points should be dense enough that they fill a reasonable region in the performance gamut with predicted points that ideally extend above the highest observed experiments. If the observed prediction space appears too sparse, one solution is to increase the number of sampling points. However, this is not the only consideration when deciding the number of points to sample. In particular, the process of training a model and selecting sampling points can take non-trivial amounts of time. The easiest way to modulate this time is by adjusting the number of sampling points (E). Given that the tempo of an SDL campaign is naturally set by the pace of collecting experimental data, one approach is to adjust the number of sampling points until the time required to perform experiments is commensurate with the time to choose subsequent samples. Further considerations are necessary when experiments are performed in batch or using asynchronous agents, but the availability of supercomputing clusters means that running prediction models in parallel can increase the prediction rate.
| This journal is © The Royal Society of Chemistry 2023 |