Scalable Gaussian processes for predicting the optical, physical, thermal, and mechanical properties of inorganic glasses with large datasets †

Among machine learning approaches, Gaussian process regression (GPR) is an extremely useful technique to predict composition–property relationships in glasses. The GPR’s main advantage over other machine learning methods is its inherent ability to provide the standard deviation of the predictions. However, the method remains restricted to small datasets due to the substantial computational cost associated with it. Herein, using a scalable GPR algorithm, namely, kernel interpolation for scalable structured Gaussian processes (KISS-GP) along with massively scalable GP (MSGP), we develop composition–property models for inorganic glasses. The models are based on a large dataset with more than 100000 glass compositions, 37 components, and nine crucial properties: density, Young’s, shear, bulk moduli, thermal expansion coefficient, Vickers’ hardness, refractive index, glass transition temperature, and liquidus temperature. We show that the models developed here are superior to the state-of-the-art machine learning models. We also demonstrate that the GPR models can reasonably capture the underlying composition-dependent physics, even in the regions where there are very few training data. Finally, to accelerate glass design, the models developed here are shared publicly as part of a package, namely, Python for Glass Genomics (PyGGi, see: http://pyggi.iitd.ac.in).


Introduction
Despite the discovery of over 200 000 glass compositions, the knowledge of composition-property relationships remains restricted to a few selected compositions. 1,2 Development of reliable composition-property maps for a large class of glass components is the bottleneck impeding the design of new glass compositions. Machine learning (ML) methods [3][4][5][6][7][8] have been used to predict properties such as Young's modulus, 9,10 liquidus temperature, 11 solubility, 12 glass transition temperature, 4,13 dissolution kinetics, 5,14-16 and other properties. 17,18 Most of these works employ traditional glass compositions as descriptors, while some other works employ physics-based descriptors. 14,19,20 A recent work developed deep learning models to predict compositionproperty maps of inorganic glasses with 37 input components and eight properties, largest thus far. 21 However, most of these studies employ deterministic models in predictions such as neural networks (NN), random forest, or simple regression models. These models cannot provide any information about the reliability of the predictions for any new test data.
Since the ML methods are primarily data-driven predictions, the model's reliability is highly dependent on the available data. To this extent, Gaussian process regression (GPR), 22 a nonparametric ML model, presents an excellent candidate. GPR employs a probabilistic approach which makes the inference on new data by learning the underlying distribution (mean and covariance) of the available data. 22 Note that various problems in mechanics and materials science employ a probabilistic framework (including GPR and Bayesian inference) to estimate material parameters. 9,[23][24][25][26][27] It has been shown that for small datasets, GPR models are more suitable in comparison to NN models for providing accurate composition-property predictions along with its confidence intervals in oxide glasses. 9 However, for large datasets available in materials science, training the conventional GPR, which has Oðn 3 Þ time and Oðn 3 Þ space complexity for a dataset of size n, is computationally prohibitive.
Herein, using scalable GPR algorithms, namely, kernel interpolation for scalable structured Gaussian processes (KISS-GP) 28 and massively scalable Gaussian processes (MSGP), 29 we address the challenge of developing reliable GPR models for predicting nine relevant properties of functional glasses: density, Young's, shear, and bulk moduli, thermal expansion coefficient (TEC), Vickers' hardness, refractive index, glass transition temperature (T g ), and liquidus temperature. The models are developed based on a large dataset with more than 100 000 glasses and 37 components. These are the largest models developed to predict composition-property relationships in inorganic glasses. We show that KISS-GP, along with MSGP, provides rigorous models for large datasets superior to the state-of-the-art deep neural network (DNN) models. 21 Further, the models provide estimates of the uncertainty associated with the predictions, making these models more reliable and robust compared to DNN models. Overall, we show that the methodology presented here can be used for developing GPR models for problems with large training datasets. Finally, the models developed here are made available as part of a software package designed for accelerating glass discovery, namely, Python for Glass Genomics (PyGGi, see: http:// pyggi.iitd.ac.in).

Dataset preparation
The raw dataset consisting of nine properties, namely, density, Young's modulus, bulk modulus, shear modulus, Vickers' hardness, glass transition temperature, liquidus temperature, thermal expansion coefficient, and refractive index of inorganic glasses are obtained from the literature and commercial databases such as INTERGLAD Ver. 7. Here, we follow a rigorous dataset preparation employed earlier to develop deep learning models for glass property models. 21 The steps involved in the methodology are as follows. (i) Remove duplicate entries from the raw dataset-the duplicate entries are first identified in terms of the input components. For the duplicate entries, the mean value of the output property is computed. If all the output values for a given composition with duplicate entries are within AE2.5%, the output is replaced with the mean value. Further, the points beyond AE2.5% are removed as an outlier. Similarly, the outliers in the extreme values are removed by considering the standard deviation of the properties. Specifically, data points lying beyond AE3s are considered as outliers and removed. (ii) Check if the components add up to 100%-after removing the duplicates and outliers, check if all the compositions add up to 100% with respect to the input components. The components that do not add up are removed. (iii) Use LAR to select to reduce dimensionality-the raw dataset consists of glass compositions with approximately 270 components. However, many of these components out of the 270 are present in a few glass compositions only. Such a sparse dataset may lead to a poorly trained model, as enough representative samples may not be present in the training and test set. To overcome this challenge, we employ the least angle regression (LAR) for dimensionality reduction. In particular, the input parameters (that is, glass components) are chosen based on the dataset's covariance, thereby drastically reducing the glass components, while still preserving a proper training set for accurate predictions (see ESI † for details). (iv) Select the relevant components from LARS-after performing LAR, a threshold in terms of the number of components is chosen until which the model exhibits a notable increase in R 2 . The input components beyond the threshold are ignored. Further, the glass compositions having those input components are also removed.
(v) Check if components add up to 100%-finally, recheck if the sum of the input components in each of the final compositions add up to 100%. Only those compositions, for which the sum adds up to 100 AE 0.1%, are selected. The final number of glasses consists of oxide components ranging from 32 to 37. The frequency of glasses corresponding to each of the components is provided in the ESI.

Gaussian process regression (GPR)
Gaussian processes (GPs) are models that are capable of modeling datasets in a probabilistic framework. The main advantages of GP models are: (i) its unique ability to model any complex data sets; (ii) estimate the uncertainty associated predictions through posterior variances computations. A GP is a joint distribution of any finite set of random variables that follow Gaussian distributions. As a result, the GPR modeling framework tries to ascribe a distribution over a given set of input (x) and output datasets (y). 22 A mean function m(x) and a covariance function k(x,x 0 ), the two degrees of freedoms that are needed to characterize a GPR fully, are as shown below.
while the mean function m(x) computes the expected values of output for a given input, the covariance function captures the extent of correlation between function outputs for the given set of inputs as In the GP literature, k(x,x 0 ) is also termed as the kernel function of the GP. A widely used rationale for the selection of kernel function is that the correlation between any points decreases with an increase in the distance between them. Some popular kernels in the GP literature are where l is termed as the length-scale parameter and s f 2 is termed as the signal variance parameter. In a GPR model, these hyper-parameters can be tuned to model datasets that have an arbitrary correlation. Also, the function f B GP(m(x), k(x,x 0 )) is often mean-centered for relaxing the computational complexity. Suppose we have a set of test inputs X * for which we are interested in computing the output predictions. This would warrant sampling as a set of f * 9 [f (x 1* ),. . .,f (x n* )], such that f * = N(0, K(X * , X * )) with the mean and covariance as By the definition of GP, the new and the previous outputs follow a joint Gaussian distribution as where, K(X,X) is the covariance matrix between all observed inputs, K(X * ,X * ) is the covariance matrix between the newly introduced inputs, K(X * ,X) is the covariance matrix between the new inputs and the observed inputs and K(X,X * ) is the covariance matrix between the observed points and the new inputs, and I is the identity matrix. Now, applying the principles of conditionals, p(f * |y) can be shown to follow a Normal distribution with: Covariance(f * ) = K(X * ,X * ) À K(X * ,X)(K(X,X) + s e 2 I) À1 K(X,X * ) Eqn (7) and (8) are employed to make new predictions using the GPR. It should be noted that the experimental values itself may contain errors, which may not be normally distributed. For example, the errors in the liquidus temperature may predominantly be in one direction due to the kinetics involved with crystallization at temperatures below the liquidus temperature. The present model does not take into account such variations, and the noise in the experimental values is modeled as a normal distribution.

Kernel interpolation for scalable structured Gaussian processes (KISS-GP)
The kernel of GP implicitly depends on the kernel hyperparameters, such as the length-scale, signal variance, and noise variance (collectively denoted as y), which are unknown and are inferred from the data. The fully Bayesian posterior inference of y is non-trivial and often intractable. Hence, to avoid complexity, the standard practice is to obtain point estimates of y by maximizing the marginal log-likelihood as log p(y|y)m À [y T (K y + s 2 I) À1 y + log|Ky + s 2 I] However, evaluation of (K y + s 2 I) À1 y and K y + s 2 I require O(n 3 ) and O(n 2 ) operations, respectively.
Approaches like the subset of regressors (SoR) 30 and fully independent training conditional (FITC) 31 are the traditional approaches that are used to scale the GP inference. Recently, Wilson et al. 28 introduced a structured kernel interpolation (SKI) framework, which generalizes point estimate methods such as FITC and SoR for scalable GP inference. For instance, the kernel in the SoR approach, k SOR , is computed as where, k xU (size 1 Â n), K UU À1 (size m Â m), K Ux T (size n Â 1)are covariance matrices generated from the exact kernel k(x,x 0 ) for a set of m inducing points [u 1 ,. . .,u m ]. Under the SKI framework, the exact kernel is replaced with an approximate kernel for fast computation by modifying k SOR considering k xU E WK UU , where W is an n Â m matrix of interpolation, which is too sparse. Therefore eqn (10) can be rewritten as This general approach to approximating GP kernel functions is the basic framework of SKI, 40 which in turn reduces the computation expense considerably, up to O(n).

Massively scalable Gaussian process (MSGP)
While KISS-GP makes learning faster up to O(n), test predictions computational complexity is the same as in the traditional GP. Wilson et al. 29 introduced MSGP, which extends KISS-GP to: (i) make test predictions significantly faster up to O(1), (ii) scale marginal likelihood evaluations without requiring any grid structure, and (iii) project input data to lower dimensional space to avoid the curse of dimensionality. In MSGP predictions, the predictive mean is computed as This is done by approximating K(X * ,X) employing SKI as given by eqn (11). 36 Here, we have to pay attention to the fact that the term K(U,U)W T (K(X,X) + s e 2 I) À1 y is pre-computed during training reducing the cost of online computations to O(1). In similar lines, predictive covariance is computed as The diagonal operator in eqn (13) is the consequence of the fact the kernel matrices are highly sparse in the non-diagonal directions. Covariance computations in eqn (13) can be further simplified utilizing SKI as follows Covariance(f * ) E diag(K(X * ,X * )) À W * diag({K(U,X)(K(X,X) Here, the term diag(K(U,X)(K(X,X) +s e 2 I) À1 K(X,U)) can also be pre-computed, 36 leading to the overall computational cost of evaluating the predictions reducing to O(1).

Materials Advances Paper
Open Access Article. Results and discussions Fig. 1 shows the distribution of the nine properties in the processed dataset used for training the GPR models. We observe that all the properties in the dataset are distributed over a wide range, most of them spanning over an order of magnitude. Note that an exhaustive dataset cleaning and preparation were performed on the raw dataset (see Methods    temperature. These represent the most extensive compositionproperty models developed in the glass science literature covering most of the human-made glass compositions. 21 Further details on the dataset, including the distribution of the glass compositions with respect to number components and for each of the input components, are provided in the ESI † (see Fig. S1 and S2). We train this dataset employing KISS-GP (see Methods) with hyperparametric tuning to develop optimized models. Note that we use KISS-GP 28 with Lanczos variance estimates (LOVE), 33 which significantly reduces the computational time and storage complexity (see Methods). Further, the prediction for high-dimensional data is carried out using MSGP. Due to the O(1) nature of MSGP, 29 computational resources associated with the prediction are independent of the size of the data, thus enabling faster predictions (see Methods for details). Fig. 2 shows the predicted values of density, Young's, shear, and bulk moduli, TEC, Vickers' hardness, refractive index, T g , and liquidus temperature, in comparison to the measured experimental values for the trained GPR model with KISS-GP and MSGP. Since there are significant overlapping points, a heat map is used in Fig. 2, wherein the respective coloring scheme represents the number of points associated with each property per unit area. Note that only the test set is plotted in the figure, although the R 2 values associated with the training and validation set is provided. The inset related to each property shows the probability density of error in the prediction with a confidence interval of 90%. We observe that the R 2 values for all the properties are equal to or greater than 0.8, suggesting a well-trained model. Further, the R 2 values of the training, validation, and test set are comparable, thereby confirming the goodness-of-fit of the model. Now, we demonstrate the key attractive features of the proposed GPR-based approach to provide the uncertainty associated with the prediction. First, we analyze standard deviation predicted by KISS-GP for the test dataset, that is, the dataset unseen by the model. Fig. 3 shows the histogram of the absolute values of standard deviations |s| corresponding to the predictions for compositions in the test dataset. Note that the |s| mentioned here is the standard deviation predicted by the KISS-GP and not the error in the predictions. This procedure is repeated for all the nine properties considered. We observe that the distribution is unimodal, with a peak closer to zero in most of the cases, which suggests that the models are reasonable with the predictions for most of the data points in the test set exhibiting high confidence. We also observe that the distribution for some properties is notably broader than others-for example, shear modulus, Young's modulus, liquidus temperature, and hardness. Specifically, hardness exhibits the maximum standard deviation among all the properties suggesting lower reliability in the predictions. Such decreased reliability for hardness could be attributed to the spread in the dataset and also to the nature of the property itself. Unlike other properties such as Young's modulus, hardness is not a material property and depends highly on measurement techniques and conditions used. [34][35][36] Thus, the final hardness model developed on the dataset with higher noise exhibits a larger standard deviation.
Second, we analyze the contribution of each of the input components in the glass composition toward the standard deviation in the prediction. To this extent, we select the glass compositions having non-zero value for a given component in the test dataset. For the selected compositions having the given component, the standard deviations associated with the KISS-GP predictions are computed. Then, the mean value of the standard deviations obtained for all the predictions is computed for a given component. For example, to compute the mean standard deviation associated with SiO 2 for Young's modulus, all the glass compositions having non-zero SiO 2 values are selected from Young's modulus dataset. Then, the standard deviation associated with the KISS-GP predictions for each of the compositions is computed. The mean value of the standard deviations thus obtained provides the mean standard deviation associated with SiO 2 for Young's modulus. The procedure is repeated for all 37 components and nine properties. Fig. 4 shows the mean value of the standard deviation for the predictions of Young's modulus corresponding to each of the input components considered. We observe that there is a direct correlation between the mean standard deviation and the frequency of components in the dataset. Specifically, components exhibiting high values of standard deviations are the ones that have very few data points (see Fig. S1 in the ESI †). On the contrary, the components that are present in many glasses, such as SiO 2 , Na 2 O, Al 2 O 3 , and CaO, to name a few, exhibit low standard deviation and spread in the prediction (see Fig. S1 in the ESI †). Similar behavior is observed for other properties as well (see Fig. S4 in the ESI †). This observation is in agreement with the fact that the training for a particular input component improves if there is enough data associated with that particular component. Overall, the results suggest that the predictions for components that are present in a larger number of glasses are more reliable and vice versa. As such, the performance of the model could be improved by increasing the number of glasses corresponding to those components for which the data is at present sparse. Note that the model doesn't exhibit any trend in terms of the accuracy of predictions with respect to the number of components in the glass (see ESI †). That is, whether a glass consists of two components or ten components, the predictive accuracy of model is comparable.
Finally, to check whether the model can capture the underlying physics, we focus on a binary sodium borate (NB) glass series, that is, (Na 2 O) x Á(B 2 O 3 ) 1Àx . This glass series exhibit the well-known boron anomaly, 34,[37][38][39] wherein the properties exhibit a highly non-monotonic variation owing to the variable coordination of boron (three to four) with increasing sodium content. Specifically, most of the glass properties exhibit an inflection for sodium content varying from 30% to 50%. Fig. 5 shows the nine properties predicted for the binary NB glass with the 2s and 3s confidence intervals, along with the experimental values. First of all, we observe that the predictions exhibit a close match with the experimental values for all the properties. Second, we observe that the model exhibits a significantly lower standard deviation for the domain where experimental data exist and a high standard deviation for the regions where data is not available. This suggests that the model exhibits increased reliability for interpolation, while confidence decreases for extrapolation. Third, and most interestingly, we observe that the model can capture the boron anomaly for all the properties. Precisely, even for properties such as shear modulus, Young's modulus, refractive index, and hardness, which have few points beyond the model, can capture the non-monotonic behavior associated with the boron anomaly in agreement with the theoretical models.
To test the generality, we choose a ternary glass composition of sodium borosilicate, x(Na 2 O)Áy(B 2 O 3 )Á1 À x À y(SiO 2 ). Fig. 6 shows the standard deviation of predicted values for the entire range of the ternary using the trained KISS-GP model. The mean values of the properties representing the best estimate of the model for this ternary are provided in the ESI. † The compositions corresponding to the measured values in the original data (which may belong to training, validation, or test set) are marked using the black squares. We observe that the standard deviation in the predictions of compositions close to the original dataset is significantly low. As the compositions are farther away from the original dataset, the standard deviation of the predicted value increases. This behavior is consistently observed for all the properties (see Fig. 6). This is because, in KISS-GP, the training is carried out by identifying the distribution that reduces the variance for a known data point (that is, training data) to zero or at least very close to it. As such, when the model is extrapolated to domains without any training data, the inference becomes poor as represented by larger standard deviation values. Nevertheless, we observe that the standard deviation for most of the properties is relatively low, confirming high confidence in the values predicted by the model. Overall, we observe that KISS-GP allows the development of reliable composition-property models, quantifying uncertainty in predictions when extrapolated over the entire compositional space. Now, we compare the performance of the GPR models with some of the state-of-the-art ML models. 21 Table 1 shows the R 2 values of KISS-GP in comparison to linear regression, XGBoost, and DNN (see ref. 21 for details) on the test dataset. Note that the dataset used for training all the models are the same. 21 Further, only the test dataset R 2 values are shown to have a fair comparison of the models on the unseen dataset. In terms of the R 2 values of the overall dataset, GP performs better than all other methods. We observe that for seven out of nine properties, KISS-GP outperforms all other methods, including DNN, in terms of the R 2 values of test data. The increase in the R 2 for some of these properties ranges from 3-5%, for example, liquidus temperature, Young's modulus, shear modulus, which is a notable increase in the accuracy. These results confirm that the predictions obtained from KISS-GP and MSGP are reliable and superior to other state-of-the-art methods, as presented in Table 1. Besides, the uncertainty quantification in KISS-GP and MSGP-based property predictions is quite useful to interpret the model validity in experimentally unexplored regimes of the compositional space. This feature is severely lacking in the other deterministic models. At this point, it should be noted that although KISS-GP is able to capture the mean values of the function correctly, the prediction of the standard deviation is highly sensitive to the training process. This could be attributed to the deep kernel layer (DKL) present in the KISS-GP framework, which reduces the dimensionality of the input feature while training the model. This is a disadvantage associated with

Paper
Materials Advances the KISS-GP in comparison to the classical GP, which can be addressed to a reasonable extent by hyperparametric optimization of the DKL and early stopping.

Conclusions
Altogether, employing KISS-GP and MSGP, we show that reliable composition-property models can be developed for large datasets. These models for predicting density, Young's, shear, and bulk moduli, TEC, Vickers' hardness, refractive index, T g , and liquidus temperature of inorganic glasses with up to 37 input components, the largest so far, allows the exploration of a broad compositional space that was hitherto unknown. In addition, the KISS-GP models are able to capture the uncertainty associated with the predictions when extrapolated beyond the training data. Further, the KISS-GP models yield superior results when compared against state-of-the-art methods such as XGBoost or DNN models. We show the KISS-GP models are able to capture the underlying physics without any explicit training, even for glass compositions and properties with sparse data. Thus, the overall contribution of the work is as follows: (i) development of reliable composition-property models for nine glass properties, (ii) quantifying uncertainty in prediction by employing scalable Gaussian process on a large glass dataset, and finally, (iii) making the models available for the glass community for accelerating glass design. The generic approach presented here can be applied for developing composition-property relationships of a wide range of materials involving extensive data with detailed uncertainty quantifications. The models developed here have been made available as part of a package, Python for Glass Genomics (PyGGi, see: http://pyggi.iitd.ac.in). 40 Fig. 6 The standard deviation for predicted values using KISS-GP. Standard deviation predicted by the trained GPR models of (a) bulk modulus, (b) shear modulus, (c) Young's modulus, (d) density, (e) liquidus temperature, (f) refractive index, (g) thermal expansion coefficient (TEC), (h) glass transition temperature (T g ) and (i) hardness for sodium borosilicate glasses. Experimental compositions are marked using black squares.

Data availability
The data that support the findings of this study are available from the corresponding authors upon reasonable request.

Code availability
All the codes used in the present work are available in the GitHub page: https://github.com/m3rg-repo/machine_lear ning_glass/tree/master/Scalable_Gaussian_Process.

Conflicts of interest
There are no conflicts to declare.