Bryce
Meredig
*a,
Erin
Antono
a,
Carena
Church
a,
Maxwell
Hutchinson
a,
Julia
Ling
a,
Sean
Paradiso
a,
Ben
Blaiszik
bc,
Ian
Foster
bc,
Brenna
Gibbons
d,
Jason
Hattrick-Simpers
e,
Apurva
Mehta
f and
Logan
Ward
bc
aCitrine Informatics, USA. E-mail: bryce@citrine.io
bUniversity of Chicago, USA
cArgonne National Laboratory, USA
dStanford University, USA
eNational Institute of Standards and Technology, USA
fSLAC National Accelerator Laboratory, USA
First published on 17th August 2018
Traditional machine learning (ML) metrics overestimate model performance for materials discovery. We introduce (1) leave-one-cluster-out cross-validation (LOCO CV) and (2) a simple nearest-neighbor benchmark to show that model performance in discovery applications strongly depends on the problem, data sampling, and extrapolation. Our results suggest that ML-guided iterative experimentation may outperform standard high-throughput screening for discovering breakthrough materials like high-Tc superconductors with ML.
Design, System, ApplicationMachine learning (ML) has become a widely-adopted predictive tool for materials design and discovery. Random k-fold cross-validation (CV), the traditional gold-standard approach for evaluating the quality of ML models, is fundamentally mismatched to the nature of materials discovery, and leads to an overly optimistic measure of ML model performance for many discovery problems. To address this challenge, we describe two techniques for contextualizing ML model performance for materials discovery: leave-one-cluster-out (LOCO) CV, and a naïve first-nearest-neighbor baseline model. These tools provide a more comprehensive and realistic picture of ML model performance in materials discovery applications. |
Ideally, the result of training such models would be the experimental realization of new materials with promising properties. The MI community has produced several such success stories, including thermoelectric compounds,10,11 shape-memory alloys,12 superalloys,13 and 3d-printable high-strength aluminum alloys.14 However, in many cases, a model is itself the output of a study, and the question becomes: to what extent could the model be used to drive materials discovery?
Typically, the performance of ML models of materials properties is quantified via cross-validation (CV). CV can be performed either in a single division of the available data into a training set (to build the model) and a test set (to evaluate its performance), or as an ensemble process known as k-fold CV wherein the data are partitioned into k non-overlapping subsets of nearly equal size (folds) and model performance is averaged across each combination of k-1 training folds and one test fold. Leave-one-out cross-validation (LOOCV) is the limit where k is the number of total examples in the dataset. Table 1 summarizes some examples of model performance statistics as reported in the aforementioned studies (some studies involved testing multiple algorithms across multiple properties).
Material class | Property | ML technique | CV type | Model performance metric | Ref. |
---|---|---|---|---|---|
Steel | Fatigue strength | Multivariate polynomial regression | Leave-one-out CV | R 2 = 0.9801 | 3 |
Organic small molecules | Norm of dipole moment | Graph convolutions | Overall 90% train/10% test, with reported test error averaged across 10 different models built on subsets of training data | MAE = 0.101 Debye (chemical accuracy target: 0.10 Debye) | 4 |
Polymers | Electronic dielectric constant | Kernel ridge regression | 81% train/19% test | R 2 = 0.96 | 16 |
Inorganic compounds | Formation energy | Rotation forest | 32% train/68% test | R 2 = 0.93 | 5 |
Inorganic compounds | Vibrational free energy | Random forest or support vector machine | 10 averaged k-fold CV runs, for k in [ref. 5 and 14] | R = 0.95 | 6 |
Inorganic compounds | Band gap | Support vector machine | 100 averaged 75% train/25% test runs | G 0 W 0 RMSE = 0.18 eV (DFT RMSE ∼2 eV wrt expt.) | 7 |
In Table 1, the reported model performance is uniformly excellent across all studies. A tempting conclusion is that any of these models could be used for one-shot high-throughput screening of large numbers of materials for desired properties. However, as we discuss below, traditional CV has critical shortcomings in terms of quantifying ML model performance for materials discovery.
To illustrate the issue of extrapolation, we draw a comparison to a different ML task: Netflix's prediction of a user's taste in movies.17 Netflix would rarely encounter the challenge of a user with entirely idiosyncratic movie preferences. Indeed, such “outlier users” might even be deliberately discarded as hindrances to making accurate predictions for the bulk of more ordinary users (Netflix's objective). Most users are similar to one or more others, which is precisely why collaborative filtering works well on such recommendation problems.18 In materials informatics, by contrast, we often desire to use ML models to find entirely new classes of materials, with heretofore-unseen combinations of properties (i.e., potential outliers).
The centrality of extrapolation in materials discovery implies that the relative distributions of training and test data should strongly influence ML model performance for this task. In particular, few real-world materials datasets are uniformly or randomly sampled within their domains. On the contrary, researchers often perform extrapolative regression (rather than the pattern-matching task of classification) on datasets that contain many derivatives of a few parent materials (e.g., doped compounds). In these cases, if a single derivative compound exists in our training set, it serves as an effective “lookup table” for predicting the performance of all of its nearby relatives. A prime example is predicting Tc for cuprate superconductors. Our goal should be to evaluate the ability of a ML model to predict cuprates with no information about cuprates. However, when using traditional CV, a single cuprate in the training set gives us an excellent estimate for the Tc values for all other cuprates, and thus, artificially inflated model performance metrics. We illustrate this “lookup table” problem for superconductors specifically in Fig. 1.
Fig. 1 gives two-dimensional t-distributed stochastic neighbor embedding (t-SNE)19 visualizations of the superconductor benchmark dataset, showing the effects of traditional and LOCO CV on predicting Tc. In this benchmark, a machine learning model is trained to predict the critical temperature as a function of chemical formula. The chemical formula is featurized using Magpie20 and other analytical features calculated based on the elemental composition. We observe in t-SNE that the superconductors cluster into well-known families, such as the cuprates and the iron arsenides. Such clustering is common in materials datasets, and provides motivation for LOCO CV. In Fig. 1a, which illustrates a typical 5-fold CV split, each magenta test point is very near (or virtually overlapping) in chemical space with a training point. The result, in Fig. 1b, is low Tc prediction errors across all families of superconductors. Fig. 1c and d show how LOCO CV partitions the YNi2B2C cluster into an isolated test set. The LOCO CV procedure, when repeated across all hold-out clusters, leads to much higher (and, we would argue, more realistic for materials discovery) errors in predicted Tc values when materials are part of the hold-out cluster, as indicated in Fig. 1e.
To systematically explore the effects of non-uniform training data, we propose LOCO CV, a cluster-based (i.e., similarity-driven) approach to separating datasets into training and test splits. We outline LOCO CV as follows:
LOCO CV Algorithm.
• Perform standard normalization of input features.
• For n total CV runs:
○ Shuffle data to reduce sensitivity to k-means centroid initialization.
○ Run k-means clustering, with k from 2 to 10. These bounds correspond to the minimum possible value of k (i.e., 2) to k corresponding to the largest common choice of 10 folds in traditional CV.
○ For each of k clusters:
■ Leave selected cluster out; train on remainder of data (k-1 clusters).
■ Predict hold-out cluster.
• To summarize results across many values of k: compute median and standard deviation across k.
○ Alternative: use X-means23 or G-means24 clustering, or a silhouette factor threshold,25 to select a single nominal value of k.
To illustrate the sharp contrast between ML results with LOCO CV and conventional traditional CV, we contrast the prediction distributions obtained from these two CV procedures for yttrium barium copper oxide (YBCO) in Fig. 2. The traditional CV results seem to suggest that the underlying model is indeed capable of discovering new compounds like YBCO, with exceptional Tc values, in a single high-throughput screening step. Specifically, when YBCO is held out of the training set in traditional CV, the model still provides high predicted Tc values. One might then conclude that novel materials discovery would be enabled by running the model against a large database of candidate compounds and simply ranking them by predicted Tc. However, Fig. 2 suggests that traditional CV is utilizing other high-Tc cuprates to trivially estimate a reasonable (i.e., very high) Tc value for YBCO, while LOCO CV has no high-Tc cuprates to train on (indicated by the difference curve in Fig. 2 in the Tc > 80 K regime).
The surprising LOCO prediction, which has no training data on cuprates due to its cluster-based train-test splitting, is that YBCO is likely to be a below-average superconductor. Nonetheless, Ling et al. show26 that ML with uncertainty quantification (UQ) can efficiently identify the highest-Tc superconductor in a database (i.e., breakthrough materials like YBCO) when used to iteratively guide experimentation in a sequential learning framework. Sequential learning (also known as active learning, on-the-fly learning, or adaptive design) is rapidly garnering interest as a driver for rational solid-state materials discovery,12,27–29 and has been applied successfully to organic molecules as well.30 For superconductors specifically, Ling et al. demonstrate that, starting from a very small, randomly-selected training set, an ML model that selects new “experiments” based on a criterion of maximum uncertainty in Tc will uncover the cuprates in consistently fewer experiments than an unguided search through the same list of superconductors.26 UQ enables ML to systematically uncover promising compounds, one experiment (or batch) at a time, even when those compounds may have e.g. a low predicted Tc in the initial screen. Thus, the use of UQ on top of ML models is crucial to evaluating candidates in new regions of design space. The ramifications of this observation deserve special emphasis: we suggest that ML models (and indeed, possibly other types of models in materials science) are more useful as guides for an iterative sequence of experiments, as opposed to single-shot screening tools that can reliably evaluate an entire search space once and short-list high-performing materials. Laboratory discoveries reported in Xue et al.12 and Ren et al.31 reinforce the efficacy of such an iterative, data-driven approach.
f(x0,x1,x2,x3,x4,x5) = x0·x1 + x2·x3 − x4·x5, |
We present ML results on this synthetic benchmark, as well as superconductor, steel fatigue strength, and thermoelectric benchmark datasets26 in Fig. 3. Using implementations in the scikit-learn32 python package, we compare three types of ML models. First, random forest33 (“RF”; 100 estimators, full-depth trees) is an ensemble method whose predictions are based on inputs from a large number of simple decision tree models (i.e., the trees comprising the “forest”). A set of decision trees, which individually are weak learners able to capture basic rules such as e.g. “large oxygen mole fraction → electrical insulator,” can—when trained on different subsets of data and subsequently ensembled—model (much) more complex relationships. Second, linear ridge regression34 (“ridge”; generalized CV was used to select from a set of possible regularization parameters α: 10−2, 10−1, 100, 101, and 102) involves extending the ordinary-least-squares (OLS) objective function of traditional linear regression with an L2 regularization term (whose strength is embodied in an adjustable parameter α) to penalize nonzero linear regression coefficients. Such regularization helps prevent overfitting, especially when collinear descriptors are present. Third, we include a naive nearest-neighbor (1NN) “lookup table” model, which generates predictions by simply returning the training value nearest in Euclidean distance to the requested prediction point; thus, it is by definition not capable of any extrapolation. In Fig. 3, the aforementioned three models are compared across traditional CV and LOCO CV; within LOCO CV, we use the scikit-learn32 implementation of k-means clustering. While the full CV curves contain valuable information, we also summarize Fig. 3 more compactly in Table 2.
Benchmark problem | LOCO RF median R (stdev) | LOCO 1NN median R (stdev) | LOCO ridge median R (stdev) | Traditional CV RF median R (stdev) | Traditional CV 1NN median R (stdev) | Traditional CV ridge median R (stdev) |
---|---|---|---|---|---|---|
Synthetic, stdev = 100 | 0.50 | 0.64 | −0.52 | 0.82 | 0.80 | 0.00 |
(0.11) | (0.05) | (0.07) | (0.04) | (0.02) | (0.01) | |
Synthetic, stdev = 10 | 0.57 | 0.68 | 0.17 | 0.84 | 0.82 | 0.47 |
(0.14) | (0.10) | (0.14) | (0.02) | (0.01) | (0.00) | |
Synthetic, stdev = 1 | 0.97 | 0.97 | 0.91 | 0.99 | 0.98 | 0.95 |
(0.81) | (0.68) | (0.76) | (0.00) | (0.00) | (0.00) | |
Superconductors log(Tc) | 0.30 | 0.50 | 0.30 | 0.87 | 0.85 | 0.76 |
(0.13) | (0.11) | (0.08) | (0.02) | (0.03) | (0.02) | |
Thermoelectrics log(zT) | 0.23 | 0.17 | 0.10 | 0.52 | 0.45 | 0.33 |
(0.08) | (0.10) | (0.12) | (0.03) | (0.03) | (0.02) | |
Steel fatigue strength | 0.24 | 0.05 | −0.41 | 0.99 | 0.96 | 0.98 |
(0.37) | (0.33) | (0.29) | (0.00) | (0.00) | (0.00) |
We also wish to comment briefly on the motivation for our choice of the three ML approaches. 1NN is subjectively the simplest possible consistent estimator: in principle, given enough data, it can learn any function. On the other hand, a linear model is subjectively the simplest model that allows for the expression of bias (in the form of the model itself, which is linear), but linear ridge regression is not able to learn an arbitrary function. Finally, RFs are related to nearest-neighbor methods, but are much more powerful, and deliver close to state-of-the-art performance on chemistry problems.35
In Fig. 3a–c, we observe that stronger clustering in the synthetic data (i.e., decreasing cluster standard deviations) creates a stark feature in the R vs. k plot: a deep minimum in model performance when k corresponds to a “natural” number of clusters associated with the dataset (the synthetic dataset has three cluster centroids by construction). This effect leads to large standard deviations in LOCO CV performance across different values of k for clustered data (see Table 2), and suggests we should be skeptical of our ability to accurately assess model performance as clustering becomes more severe. Relatedly, we note that the 1NN model performs well in traditional CV for highly clustered data (synthetic dataset with stdev = 1, and also the steel fatigue strength benchmark). Finally, as random forest and 1NN are both neighborhood-based methods,36 we include a linear ridge regression to show that our conclusions also apply to non-neighborhood methods.
Table 2 shows that RF performs consistently best within traditional CV, which suggests that, when this algorithm has full information in the neighborhood around a test point, it can (as expected) make more accurate predictions than a nearest-neighbor model. Within LOCO CV, we see that while RF achieves the highest R values for the thermoelectric and steel fatigue benchmarks, it fails to outperform 1NN for superconductors and the synthetic data. This result, together with the remarkably strong performance of 1NN for highly clustered data, demonstrates that 1NN is an essential benchmark to contextualize performance of materials informatics models. In other words, ML can enable more efficient discovery of superconductors,26 even if a given ML model's ability to extrapolate directly to the cuprates is no better than that of a 1NN lookup table.
Our LOCO CV results reveal that one-shot extrapolation to entirely new materials classes, without formally taking degree-of-extrapolation into account (e.g., the notion of “distance control” presented by Janet, Chan and Kulik37), poses a significant challenge to ML. This observation, together with the work of Ling et al.,26 suggests that UQ-based sequential learning (i.e., the ability of ML to plan the iterative, systematic exploration of a search space) may be more important to data-driven materials discovery than making extremely accurate predictions of novel materials' properties. We thus frame the ideal application of ML in materials discovery as experiment prioritization, rather than materials property prediction, for which e.g. DFT is often used. We also note that the general difficulty for ML to extrapolate from one cluster (or physical regime) to another provides motivation for further work in transfer learning,38 and could help explain why multitask learning has exhibited some success on physical problems such as molecular property prediction.35
This journal is © The Royal Society of Chemistry 2018 |