Andrew
Novick‡
*a,
Diana
Cai‡
b,
Quan
Nguyen
c,
Roman
Garnett
c,
Ryan
Adams
d and
Eric
Toberer
a
aDepartment of Physics, Colorado School of Mines, Golden, Colorado, USA. E-mail: novick@mines.edu
bCenter for Computational Mathematics, Flatiron Institute Address, New York, New York, USA
cDepartment of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri, USA
dDepartment of Computer Science, Princeton University, New Jersey, USA
First published on 5th August 2024
Active learning is a valuable tool for efficiently exploring complex spaces, finding a variety of uses in materials science. However, the determination of convex hulls for phase diagrams does not neatly fit into traditional active learning approaches due to their global nature. Specifically, the thermodynamic stability of a material is not simply a function of its own energy, but rather requires energetic information from all other competing compositions and phases. Here we present convex hull-aware active learning (CAL), a novel Bayesian algorithm that chooses experiments to minimize the uncertainty in the convex hull. CAL prioritizes compositions that are close to or on the hull, leaving significant uncertainty in other compositions that are quickly determined to be irrelevant to the convex hull. The convex hull can thus be predicted with significantly fewer observations than approaches that focus solely on energy. Intrinsic to this Bayesian approach is uncertainty quantification in both the convex hull and all subsequent predictions (e.g., stability and chemical potential). By providing increased search efficiency and uncertainty quantification, CAL can be readily incorporated into the emerging paradigm of uncertainty-based workflows for thermodynamic prediction.
New conceptsThe dominant research paradigm in computational thermodynamics involves producing increasingly high-fidelity surrogate models such as machine-learned interatomic potentials or cluster expansions. Here, we take the complementary approach by minimizing the number of thermodynamic calculations necessary to evaluate stability and solubility. This acceleration relies on efficiently resolving the convex hull. With Gaussian processes, we propagate uncertainty in the energy surface of each phase to the resulting convex hull. As such, thermodynamic calculations are chosen that minimize the information entropy of the probabilistic convex hull. By applying a Bayesian approach, we make the uncertainty explicit in our hull predictions as well as any subsequent calculations derived from the hull. Such a framework can be complemented with Bayesian surrogate models, enabling end-to-end uncertainty quantification. |
Phase transitions often occur across length- and time-scales too large to be directly observed using simulations. Instead, thermodynamic potentials need to be evaluated across a vast space of competing compositions and phases. The outcome of this competition is encapsulated in the convex hull: a single mathematical object that wraps the energy surface and defines the set of stable phase–composition pairs. Convex hulls are often associated with predicting the stability of compounds without external fields at 0 K,28–36 but they have also been used to calculate phase transitions induced by temperature,22 pressure,37–39 anisotropic stresses in thin films,40,41 magnetic fields,42,43 and applied voltages in battery materials.44,45 Indeed, the convex hull formalism can be used to predict stability under any set of thermodynamic conjugate variables.46,47 Beyond phase diagrams, convex hulls have been recently leveraged in understanding chemical reaction networks and synthesis pathways.48–51
The global nature of convex hulls implies that it is not obvious which composition–phase pairs will reside on the hull. For instance, it is possible for the exact value of the energy to be certain, while still being uncertain that the composition is on the hull. A brute force approach to predicting the convex hull would require calculating the energy for all competing phases and compositions. However, when the cost of individual energy evaluations is large, or the space of possible competing compositions is high-dimensional, exhaustively evaluating the energies is prohibitively expensive. Thus, there are two complimentary modes of acceleration: efficiently producing surrogate models that lower the cost of energy calculations and minimizing the number of energy evaluations necessary to define the convex hull. Both approaches can leverage active learning,52 since it is a natural method for selecting expensive data points that are expected to maximally increase the information about a function.
To optimize the information gain about a surrogate energy function, active learning has been used to iteratively select first-principles calculations that minimize uncertainty in the surrogate model. Surrogate models like cluster expansion53 and interatomic potentials54 have been trained with active learning; they were then leveraged to conduct numerous energy evaluations for predicting the underlying convex hull. Active learning has also been biased to identify phase–composition pairs that are expected to be on or near the convex hull.55–57 While these approaches have been shown to be more efficient than random and grid-based search procedures, the active learning was only biased using proxies that incorporate a local view of the hull rather than directly reasoning about the entire convex hull as a singular, global object.
In this paper, we develop convex hull-aware active learning (CAL) to accelerate stability predictions. CAL distinguishes itself from more conventional Bayesian approaches by reasoning directly about the entire convex hull. CAL uses separate Gaussian process regressions to model the energy surfaces of phases across the composition space. From the Gaussian processes, a posterior belief is produced over possible convex hulls. This induced posterior enables the algorithm to identify composition–phase pairs that are expected to minimize the uncertainty in the convex hull itself, not the constituent energy surfaces. By focusing exclusively on the convex hull, it is possible to make more effective decisions on what compositions to consider.
We start with illustrating the CAL algorithm in one dimension for clarity. The evolution of the convex hull distribution is seen with increasing observations, and both stability predictions and chemical potentials are derived. From there, we explore complex ternary composition spaces with three competing phases. This allows us to quantitatively demonstrate the efficiency of CAL against a baseline active learning procedure and explore analysis techniques for probabilistic hulls. Finally, we demonstrate how CAL can be implemented when there is prior knowledge about line-compounds, as is often the case.
We model the energy surface with a Gaussian process (GP), which provides a prior on energy surfaces specified by a mean and covariance function.58,59 Conditioning on the observations results in a posterior distribution over energy surfaces that is itself a Gaussian process (eqn (2) and (3)). Let F
be the random function associated with the posterior on energy surfaces; then H
=
[F
] is the induced random (lower) convex hull, where
is the convex hull operator. The random function H
is the object of primary interest in this work.
As we are only considering a finite set of candidate compositions, it is possible to generate samples from this induced posterior by (1) drawing a sample from the multivariate Gaussian distribution resulting from the GP posterior, and (2) using a standard algorithm such as QuickHull60 for computing the lower convex hull of a set of points. Fig. 1a shows a posterior distribution over the energy surface, F, and Fig. 1b depicts three posterior samples and their associated convex hulls.
Our epistemic uncertainty about the true convex hull is captured by the random function H; the Shannon entropy
then quantifies our (lack of) knowledge about the convex hull. By framing our problem as one of minimizing
, we can more rapidly gain information about the structure in which we are most interested.
In addition to the hull itself, various properties of interest can be derived from H, so we can reason about their posterior distributions as well. For example, the (random) set
![]() ![]() ![]() ![]() |
Fig. 1c shows 20 samples of stable sets after the 3 iterations in Fig. 1b. These binary classifications can be averaged to estimate the marginal probability that any given composition is on the hull, i.e., is stable (Fig. 1d). Note that these marginal probabilities reveal an important way in which this problem is different from conventional Bayesian optimization and active learning tasks: the global nature of the convex hull means there is uncertainty about stability even for compositions in which the energy has been noiselessly observed. In this example, the observed compositions are marked with dashed vertical lines in Fig. 1d and there is uncertainty about the stability in two of the three cases.
Like many Bayesian optimization and search algorithms, the selection of x* requires approximating the expected information gain (EIG) across the space of possible designs, which in our case is the set of compositions.61,62 The EIG is simply the difference between the Shannon entropy of the current state (reflected in the observed data, ) and the expected Shannon entropy after making an observation at an unobserved composition x. Of course, the energy value y is unknown at this point and so the new set of observations
∪ (x,y) is considered in expectation:
![]() | (1) |
Finally, the expected information gain is used within each iteration to select x*, the candidate composition to be evaluated:
Fig. 2 illustrates how the EIG is evaluated in practice. In Fig. 2a, we start with a GP conditioned on some data, . Energy surfaces are sampled from the resulting posterior distribution, convex hulls are calculated, and the Shannon entropy of state
is calculated, giving us the first term in eqn (1).
For a given candidate composition x, we sample from the conditional Gaussian process posterior at x to obtain a set of K possible energy values, denoted yk. In other words, these yk values correspond to different energies for composition x given the current uncertainty within our energy model. For each of these K samples, the entropy is estimated in three steps. (1) The Gaussian process is conditioned on this “fantasized” pair of observations (x,yk), and energy surfaces for all considered compositions are sampled from the resulting distribution. (2) For each of these sampled energy surfaces, a convex hull is computed. (3) The convex hull samples are used to estimate the Shannon entropy (eqn (4)), as detailed in the Methods. The expectation value of the Shannon entropy is then calculated by averaging the K entropy estimates (eqn (6)), resulting in an estimate of the second term in eqn (1), thereby completing our evaluation of the EIG.
We continue to illustrate this algorithm in panels Fig. 2b where three hypothetical energy values for composition x lead to three different hull distributions. For contrast, a different composition is selected for Fig. 2c, resulting in visibly greater variation in the hulls and thus a higher expected Shannon entropy. In Fig. 2d, the process is repeated across composition space to determine the composition with the maximum EIG (i.e., x*). (For panel b, the optimal value x* was intentionally selected to visually emphasize the impact that sampling at x* would have.) Finally, an observation is made at x* to update and the algorithm repeats to refine the convex hull. We reiterate that this approach seeks to minimize the Shannon entropy in the convex hull, not simply observe points that are on the hull. Here, observing the composition x* is advantageous because regardless of its energy, the resulting distribution in possible convex hulls narrows significantly.
Elemental chemical potentials are critical in predicting defect concentrations, as defect creation involves exchanges with element and charge reservoirs. For example, in LiZnSb, the limited chemical potential window of Li renders the compound significantly Li-deficient even in the presence of secondary phases with excess Li (e.g. Li3Sb).63 Chemical potentials of charged species can also be leveraged to produce intercalation voltage curves in battery materials,45 as was done for LixCoO2.44 Lastly, pressure is an intensive variable that can be determined from the convex hull of an energy surface that is a function volume.37–39 For example, the impact of volumetric confinement on the freezing point of water can be readily determined from the hull.64
As previously mentioned, CAL acquires observations that minimize the uncertainty in the convex hull distribution. The behavior of the algorithm can be characterized by two steps. In the first few iterations when there is large uncertainty, Fig. 4b shows that the algorithm tends to explore the energy surface, producing a coarse estimate for the convex hull. As the estimate of the convex hull develops, the algorithm focuses its next iterations increasingly on regions that are purportedly on the hull or close to it. These subtle refinements to the convex hull distribution are reflected in Fig. 4c, where the convex hull samples converge.
In low dimensions, producing an accurate hull can be achieved via brute force. However, the necessity for efficient hull construction emerges in spaces that involve multiple competing phases and large composition spaces. To test the efficiency of CAL in such a space, we pit it against two opponents: a baseline algorithm (BASE), and farthest point sampling (FPS). BASE still models the energy surfaces using a Gaussian process. However, BASE seeks to minimize the uncertainty in the energy surfaces and has no knowledge of convex hulls. See the Methods for further information about the BASE policy. FPS does not leverage a Gaussian process – indeed, it is not aware of any energetic outcome. Rather, FPS simply chooses the composition that is farthest away from all observed compositions.
Across all three metrics shown in Fig. 5, CAL outperforms BASE and FPS. For CAL, the mean absolute error (MAE) is nearly zero by 50 iterations. Similar convergence is found for the true positive and false positive rates. Together, these metrics indicate that by 50 iterations (i.e., 25% of the search space), CAL is able to predict the energy of the convex hull as well as classify which compositions are on and off the hull. BASE, however, takes significantly longer to come to these conclusions. Considering that there only 198 phase–composition pairs in this space, BASE requires observing nearly 100 phase–composition pairs to understand the convex hull. Not only does BASE finish far slower, but its rate of learning is consistently lower through the search process, as shown by its smaller slopes in Fig. 5a–c. From the width of the shaded regions, we conclude that BASE is much more variable than CAL. FPS learns far slower than both CAL and BASE, and it is also more variable, underlying the importance of using Gaussian processes to determine the hull.
Fig. 6 shows a representative example from Fig. 5 to understand the root of how CAL so efficiently and consistently reveals the hull. The true energetic landscape is shown in panel (a) with energy surfaces corresponding to the three distinct phases. A slice through these energy surfaces is shown in (e); here, we show from B to intermediate composition AC. Additionally, a slice of the true convex hull is included below in grey. In panel (i), the complete convex hull is projected onto two dimensions as a ternary phase diagram. The distribution of energies relative to the hull is included in Section S1 of the ESI.† The three energy surfaces are similar in energy, resulting in a fairly complex phase diagram. As such, this is a challenging task for hull determination.
![]() | ||
Fig. 6 The evolution of the CAL performance is shown quantitatively in Fig. 5; further insight can be gained by visualizing the evolution of the GP and the associated hull for a single set of energy surfaces. To investigate how CAL performs with three phases spanning a ternary composition space (continuing Fig. 5), a single example is considered with increasing observations. (a) Each phase has an energy surface that spans the composition space. (e) A slice of the ternary space from B to AC shows the energies of these competing phases and a corresponding slice of the convex hull. (i) The full convex hull is represented as a ternary phase diagram. (b) After 10 iterations of CAL, the three Gaussian processes are illustrated by plotting their means and coloring the surfaces with their associated uncertainties. (c) and (d) With increasing iteration, CAL prioritizes learning about phase–composition pairs that are relevant to the convex hull, resulting in regions transitioning from high (orange) to low (purple) uncertainty. (f)–(h) A similar progression can be seen in the slice from B to AC. Ultimately, we are interested in predictions of the hull and the associated phase diagram. (j) Before making any observations (iteration 0), the uncertainty in the convex hull distribution is represented by overlaying 100 convex hull samples on a ternary phase diagram. (k) and (l) With increasing iteration, the distribution tightens and converges around the true convex hull. |
We model the three energy surfaces using separate Gaussian processes and conduct a total of 50 observations within this system. In panels (b)–(d), we show the mean of each GP and color the three surfaces by their standard deviation. In (b), before any observations, all energy surfaces have significant uncertainty and are thus orange. With increasing iteration, both the mean energies evolve and the uncertainties decrease for select composition regions; it will be made clear that these regions are targeted by CAL for their relevance to the convex hull. The evolution of energetic uncertainties can be clearly seen in the B–AC slice. Composition–phase pairs near the hull show evidence of significant observation and an associated reduction in uncertainty. It is important to note that only observing the lowest energy phase would not have been an optimal solution – different phases affect the hull in different regions.
In panels (j)–(l), 100 hulls are projected and overlaid onto the ternary phase diagram. As expected, no coherent expectation for the hull is present initially. By 30 iterations, most of the single-phase regions have been identified, but there is still significant uncertainty. As such, some unstable compositions are classified as having a non-zero probability of being on the hull, resulting in a smearing out of the ternary phase diagram. Finally, after 50 iterations, much of the lingering uncertainty has dissipated and the convex hull is well understood.
It is also explained how our method may play a role in a broader uncertainty-based thermodynamic workflow. First, the importance of uncertainty quantification is discussed, then we consider how CAL may interact with sources of uncertainty that precede it in a workflow. Finally, we talk through how the uncertainty in CAL predictions is propagated forward to other thermodynamic predictions.
Joint Gaussian processes are well-suited for incorporating compositional correlations into the energy model.65,66 In a joint Gaussian process, the energy surface of each phase would be modeled simultaneously; the inputs for such a model would be observations across all phases, and the outputs would be the energy surfaces for each phase. Incorporating joint GPs into CAL would leave the acquisition function unchanged.
The policy for determining the next optimal observation would need to be extended in order to account for temperature as an added dimension in the design space. The added complexity derives from the free energy convex hull only being defined over composition space at a single temperature. As such, the total expected information gain for a single phase–composition–temperature triplet would need to be assessed as a sum over the expected information gains across temperatures of interest. In practice, the temperature range would need to be discretized to make evaluating the total information gain feasible.
A special case of temperature-dependent search involves thermodynamic methods where calculating the enthalpy of formation is the computationally limiting factor and the entropy can be approximated analytically.67–69 As such, with these methods the free energy can be predicted at multiple temperatures with no additional cost. The ramifications of this set of observations would need to be incorporated into the acquisition function.
A more important consideration is the number of compositions considered. Sampling from a GP scales cubically with the number of compositions. In Section S3 of the ESI,† we demonstrate that computational cost scales cubically with the number of compositions. This highlights the potential for efficiency gains through adaptive composition selection.
In truly large spaces, one may want to prioritize composition sub-regions. The acquisition function can be readily altered to exclusively focus on such regions. Here, the expected information gain would only reflect minimizing the uncertainty for the convex hull in those prioritized regions. The resulting efficiency gain will be dependent on how many different multi-phase regions enclose the specified compositions.
Other approaches center around decreasing the cost of the EIG. For instance, the EIG could be calculated with fewer convex hull samples. Another approach would employ BASE in the beginning of the search and CAL only after some number of iterations. Since CAL is more expensive, it would be reserved for later in the search when there is sufficient information about the hull such that the CAL policy results in significantly different decisions from BASE. Finally, one could approximate the joint entropy as a sum of the entropies across individual compositions. This is a strong approximation for the entropy and should be taken with caution since it assumes convex hulls have no correlations between compositions. All these shortcuts add parameters requiring tuning to negotiate between speed and quality.
In an uncertainty-based thermodynamic workflow, CAL could be useful in iteratively training surrogate models with energetic uncertainties like the Bayesian approach to cluster expansion.53,71–74 Here, completing the necessary first-principles calculations to train such models is the limiting factor. Such training would be focused on minimizing the uncertainty in the convex hull rather than predicting energies.
Specifically, instead of the GP used in our work, the surrogate model would be leveraged to produce uncertainty in the convex hull distribution before and after a potential observation. The simplicity of such an inexpensive surrogate makes it computationally feasible to retrain numerous times, which is necessary for choosing the optimal observation. Once an optimal composition is identified by CAL, its energy would be calculated using first-principles, and the result would be included in the training set for the surrogate model.
Due to the low computational cost of calculating compound energies, we suggest that these observations are conducted before using CAL. Subsequently, CAL would be employed to suggest calculations for highly expensive free energy evaluations, like those required for alloys. Such an example is illustrated in Fig. 7, where at iteration 0 (before CAL has been used) we start with knowing the energy of the parent compounds and two line compounds with compositions x = 0.2 and x = 0.5 (red points). The compounds do not play a role in training the Gaussian processes. However, they are important for determining the convex hull distribution. Here, each of the three GPs is sampled and the convex hull is calculated for those three surfaces and the compounds. As before, CAL makes observations to minimize the uncertainty in the convex-hull distribution. Due to the low-energy compound at x = 0.2, the first three observations are heavily biased to the right (i.e., B-rich compositions) since that composition region is the only portion with remaining hull uncertainty. Once that region is pinned down, CAL makes additional observations towards the left to remove any remaining uncertainty in the hull. After 6 iterations, the convex hull distribution is pinned down, as made clear by the almost perfect overlap of the convex hull samples. Finally, it is worth noting that since the orange phase was too high in energy to affect the convex hull, no observations were made for that phase.
In the above example, we treat the compound energetics as having no uncertainty. However, there can be uncertainty in compound energetics; these uncertainties can be propagated to the convex hull distribution as well, allowing for an active learning process. Uncertainty in compound energies is valuable when dealing with particularly expensive systems where lower-fidelity models are used and the uncertainty in those low-fidelity models can be approximated.
F(x) ∼ GP(m(x),k(x,x′)), |
The convex hull operator takes an energy function F and returns its lower convex envelope H =
(F). Thus, the GP prior on the energy function F implies a prior on its convex hull H.
Given N observations = {(xn,yn)}Nn=1, the posterior of the energy function p(F|
) is also a Gaussian process. Define the vector of energies Y = [y1,…,yN]T and the matrix X whose rows consist of the elements {xn}Nn=1. The posterior of the energy function is p(F|
) = GP(
(x),
(x,x′)) with mean and covariance function of the form
![]() | (2) |
![]() | (3) |
In practice, we represent F (and H) using a dense grid of c candidate compositions. In this case, the posterior of the energy values on this grid becomes a multivariate Gaussian distribution with a mean and covariance matrix arising from (eqn (2)) and
(eqn (3)) evaluated at those points.
The posterior over the energy surface F induces a posterior over the convex hull function p(H|). To generate a random function from this posterior, i.e., H
∼ p(H|
), we first sample F
from p(F|
) and then construct its convex hull, i.e., H
=
(F
).
Recall that to compute the EIG for the random hull H ∼ p(H|
) (eqn (1)), we need to compute the entropy
and the expected entropy
, where y is the (unobserved) energy of a new candidate composition x. The entropy
is defined as
![]() | (4) |
A key challenge is to estimate the entropy since it is not available analytically. In particular, in large and high-dimensional composition spaces, the expectation in eqn (4) involves a high-dimensional integral and a high-dimensional log hull density, both of which are challenging to estimate accurately and efficiently using numerical methods.
To address this computational issue, we approximate the hull distribution in a way that allows us to compute eqn (4) in closed form. We assume that random values of the convex hull (evaluated on a dense grid of c elements) follow a multivariate Gaussian distribution with covariance Σ. It is common to use Gaussian approximations to approximate challenging posterior distributions (e.g., Laplace approximation,76 variational inference77–79), and they can be useful even if the posterior is non-Gaussian. Ultimately, the entropy is used in ranking potential observations; exact calculations of the entropy are neither feasible nor necessary.
For the entropy of a multivariate Gaussian, the only unknown value that needs to be computed is the covariance matrix, which can be estimated empirically from m convex hull samples. Hj ∼ p(H|), here, each Hj is a vector of length c, and we use those vectors to construct the covariance matrix:
![]() | (5) |
The entropy of the multivariate Gaussian, which only depends on the covariance Σ, can be computed in closed form:
![]() | (6) |
For the expected entropy , we compute a Monte Carlo estimate of the expectation:
![]() | (7) |
All observations had no noise associated with them, although observational noise can readily be incorporated. Shaded regions in the GP plots show two standard deviations from the mean prediction. Convex hulls were generated using the qhull algorithm60 within the scipy library.82 Custom code was built to isolate the lower bound of the hull, which is the portion of interest for thermodynamics.
Here we discuss the specific sampling parameters used in the work. In the 1D search evolution shown in Fig. 4, there were 21 compositions in the space. For the ternary search in Fig. 5 and 6, there were 66 total compositions. For both the 1D and 2D search, 200 energy and convex hull samples were used for each entropy calculation, and 10 possible y-values were used to build the expected information gain (i.e., m = 200, K = 10). We find that performance varies slightly with the choice of m and K, but not significantly (see Section S2 in the ESI†). As a general rule, K can be fairly low (i.e., 10) since K is being used to approximate a one-dimensional integral. However, m needs to be larger to empirically compute the covariance matrix of a high-dimensional Gaussian (where c is the number of dimensions). Here we set m > 3c for all systems. Users are encouraged to conduct their own convergence testing.
The baseline active learning algorithm, which also uses a GP model for the energy surface, selected compositions to maximize the information gained about the energy surface. Specifically, BASE maximized the EIG with respect to the energy function (EIG-B):
When multiple phases were present, BASE chose the composition–phase pair that maximized the EIG. In Fig. 5, the policy resulted in BASE alternating evenly between phases. BASE used the same GP hyperparameters as CAL to control for hyperparameter tuning.
The performance of each policy was assessed using the mean absolute error (MAE) for the energy of the convex hull, the true positive rate (TPR), and false positive rate (FPR). The MAE here is defined by:
![]() | (8) |
The true positive rate is the percentage of the points that are on the hull that are correctly identified:
![]() | (9) |
The FPR refers to the percentage of the points that are off the hull that were incorrectly identified:
![]() | (10) |
FP is the number of false positives, which is the number of compositions that were incorrectly identified as being on the convex hull. TN stands for true negative, and is the number of compositions that were correctly identified as being off the hull. The FPR is also calculated for each hull sample and then averaged across all samples.
For both CAL and BASE, 200 hulls were used to evaluate the MAE, TPR, and FPR for a given iteration. A composition was defined as being on the hull if its energy was within 10−3 of the energy of the hull.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4mh00432a |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2024 |