Mengxian
Yu
a,
Yin-Ning
Zhou
b,
Qiang
Wang
a and
Fangyou
Yan
*a
aSchool of Chemical Engineering and Material Science, Tianjin University of Science and Technology, Tianjin 300457, P. R. China. E-mail: yanfangyou@tust.edu.cn
bDepartment of Chemical Engineering, School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai 200240, P. R. China
First published on 19th April 2024
Machine learning (ML) can provide decision-making advice for major challenges in science and engineering, and its rapid development has led to advances in fields like chemistry & medicine, earth & life sciences, and communications & transportation. Grasping the trustworthiness of the decision-making advice given by ML models remains challenging, especially when applying them to samples outside the domain-of-application. Here, an untrustworthy application situation (i.e., complete extrapolation-failure) that would occur in models developed by ML methods involving tree algorithms is confirmed, and the root cause of its difficulty in discovering novel materials & chemicals is revealed. Furthermore, a universal extrapolation risk evaluation scheme, termed the extrapolation validation (EV) method, is proposed, which is not restricted to specific ML methods and model architecture in its applicability. The EV method quantitatively evaluates the extrapolation ability of 11 popularly applied ML methods and digitalizes the extrapolation risk arising from variations of the independent variables in each method. Meanwhile, the EV method provides insights and solutions for evaluating the reliability of out-of-distribution sample prediction and selecting trustworthy ML methods.
Grasping the trustworthiness of the decision-making advice given by ML models remains challenging.30–33 Influencing the trustworthiness of model decision-making involves the whole process of modeling, i.e., not only the preparation of data but also the process of algorithm selection, hyper-parameterization, etc.34 The accurate prediction of previously unknown things and the generation of reasonable decisions by ML models are derived from the data information available during development. As such, model uncertainty arising from the range of data and its distribution may lead to models making unconvincing (high-risk) decisions. For example, Li et al.35 discovered that ML models trained on Materials Project 2018 may have severely degraded performance when predicting new compounds for Materials Project 2021, which was attributed to the changes in the distribution of the dataset.
Undoubtedly, if the prediction samples are located inside or on the boundary of the convex hull of the training dataset, the model prediction ability approximates its interpolation ability; if the prediction samples are located outside the convex hull, the model prediction ability depends on its extrapolation ability.36 Thus, on any high-dimensional dataset, interpolation will almost certainly not occur, and model predictability will be more dependent on the extrapolation ability.36,37 For the field of chemistry, the feature space is usually defined by the range of descriptors or the distribution of groups for the molecule.38 The number of descriptors or groups corresponds to the dimension of the feature space. Indeed, an intersection of the prediction samples with the training dataset in one or more dimensions is possible,39 and it is hard to assess the extent of the intersection, which is a source of model uncertainty.
Model uncertainty can be estimated using cross-validation and external validation tools.33,34,40–42 External validation is performed on data not involved in modeling. Cross-validation divides the training set according to various data partitioning schemes (e.g., random, leave-one-out, cluster, or time-split43) to evaluate the performance of the model in future applications. For example, Meredig et al.44 proposed the leave-one cluster-out cross-validation method (LOCO CV), which is based on the K-means clustering algorithm and classifies the samples into multiple clusters, and then divides the training and test sets according to the clusters. The k-fold-m-step forward cross-validation (kmFCV) proposed by Xiong et al.45 divides the training and test sets according to the sequence of target values. Worth considering, the property distribution of molecules in the training set may not be identical to the distribution of molecules encountered in the future, i.e., molecules encountered in the future may be outside of the domain-of-applicability of the model. In time-split cross-validation, the model is trained on data generated before a certain date and tested on a retained dataset generated after that date. Thus, the time-split cross-validation method is deemed to be closer to the evaluation of new discoveries, i.e., prospective validation.34,46 Due to data availability constraints, it may be difficult to obtain data that conform to this approach in certain cases.
While cross-validation and external validation provide an important tool for testing the potential utility of ML workflows, they are unable to distinguish between the predictions for in-domain and out-of-domain samples, which makes it hard to provide a quantitative evaluation of the extrapolation ability of an ML model. The consequences would be inconceivable if ML model extrapolation performance degradation, even extrapolation failure, occurs in artificial intelligence (AI)-driven applications, especially in high-risk scenarios such as self-driving cars, automated financial transactions, and smart healthcare. Hence, a method for quantitatively evaluating the extrapolation ability of a model is desired to reasonably circumvent the extrapolation risk.
Here, 11 ML methods are tested for out-of-domain sample prediction results on datasets with linear univariate, linear multivariate, and nonlinear multivariate functional relationships. Based on the extrapolation results, the involvement of the tree algorithm is suspected as the prime culprit in the extrapolation-failure of the ML model. Subsequently, the potential reasons are explored by using the random forest (RF) method as an example. To quantitatively evaluate the extrapolation ability, an extrapolation validation (EV) method is proposed. The EV method is applied to ML models with data from deterministic functional relationships, and the quantitative structural property relationship models for glass transition temperature (Tg) of polyimide (PI) in the macromolecular field as a real-world application example.
(1) |
(2) |
x3 = log(x) | (3) |
(4) |
(5) |
y = x | (6) |
y = x1 + x2 + x3 + x4 + x5 | (7) |
(8) |
To observe the extrapolation ability of ML models, we developed models for data with deterministic functional relationships (i.e., linear univariate, linear multivariate, and nonlinear multivariate) by 11 ML methods, including multiple linear regression (MLR), least absolute shrinkage and selection operator (LASSO), ridge regression (Ridge), support vector machine (SVM), Gaussian process regression (GPR), multilayer perceptron (MLP), adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), RF, K-nearest neighbor (KNN), and gradient boosting decision tree (GBDT) algorithms. A test (B) set, and a test (F) set are set up to evaluate the extrapolation performance of ML models, where the test (B) set is dominated by dependent variables below the minimum value of the dependent variable in the training set, and the test (F) set is dominated by dependent variables above the maximum value of the dependent variable in the training set. Moreover, a test (I) set in which the dependent variable is included in the range of the dependent variable of the training set is used to validate the interpolation ability of the models.
Fig. 1 Schematic of extrapolation degree (ED). h is defined in eqn (9); “forward sequence” means that the samples in the data set are sorted according to the independent variable xi from the smallest to the largest; “backward sequence” means that the samples in the data set are sorted according to the independent variable xi from the largest to the smallest; see eqn (10) for the definition of ED. |
The leverage value (h) is part of the applicability domain (AD) of the QSPR model.47 The AD is defined as the space that contains the chemical space of the molecules in the training set. The AD is both important for model evaluation and recognised by the Organization for Economic Co-operation and Development (OECD).48 Within chemistry and drugs among other related fields, h is often used to check compounds affected by structure (i.e., independent variables) in QSPR modeling.49h is defined by all independent variables in the model, as described in eqn (9).
h = xi(XTX)−1xTi | (9) |
Considering the contribution from all the independent variables, the serialized h is applied for dividing the training and test sets. Both forward serialization (from small to large values) and backward serialization (from large to small values) are adopted, i.e., forward extrapolation validation and backward extrapolation validation. Following this approach, all independent variables for the developed model are evaluated.
To describe the EV method clearly, a list of its steps is as follows:
(1) Calculate the training set sample leverage value (h) according to eqn (9).
(2) Sort the same one independent variable (xi) for all training set samples from the smallest to the largest (i.e., forward sequence).
(3) The first 80% of the samples in the sorted training set are used as the training (EV) set, and the last 20% of the samples are used as the test (EV) set.
(4) Fitting the developed model using the training (EV) set.
(5) Using the re-fitted model in step 4 to predict the test (EV) set samples, evaluate the performance of the re-fitted model on the test (EV) set.
(6) Repeat steps 2 to 5 for h and all independent variables (x).
(7) Replace the order of the training set samples in step 2 to “follow the order from the largest to the smallest (i.e., backward sequence) of the training set xi”, and then repeat steps 2 to 6 for h and all x.
When serializing extrapolation for one independent variable, it is difficult to ensure that all independent variables in the test (EV) set are outside the corresponding range of the training (EV) set, at which point the performance of the test (EV) set will inevitably include the contribution from interpolation. Hence, the extrapolation degree (ED; eqn (10) and Fig. 1) is defined as a metric to assist in evaluating the extrapolation ability of the model. Of those, “ei,j” in the definition of ED is the distance at which the independent variable i of sample j in the test set is outside the range of independent variable i in the training set. Since not all sample independent variables of the test (EV) set are outside the range of the training set, the performance of the test set (EV) includes the contribution from interpolation ability. The ED quantifies the extent to which the independent variables of the test (EV) set are outside the corresponding domain of definition of the training set, thereby digitizing the fraction of the test (EV) set performance that is really extrapolated. When the performance of the test (EV) set is good and the ED is high, it indicates that the predicted values of the model are reliable even for the samples that are farther away from the training domain. And when the performance of the test (EV) set is good and the ED is low, it indicates that the model may only be able to extrapolate to the samples in the closer range outside the training domain. Therefore, ED can be used as an assistant metric to evaluate the extrapolation ability of a model. Furthermore, the standard deviation of the samples within the 95% confidence level interval (σ95, eqn (S2)†) is presented as a threshold for the evaluation of the extrapolation ability. If the RMSEtest(EV) of an independent variable is greater than σ95, then it is possible that the prediction error of the model is greater than the difference between the actual value and the mean of the samples within the 95% confidence level interval. The average of all RMSEs of the re-fitted models after serialisation of the independent variables and h, i.e. the average RMSE, is taken as a statistical parameter to evaluate the overall extrapolation ability of the model.
(10) |
During hyperparameter conditioning of the regressors involving tree-based algorithms (ESI Fig. S1–S3†), all predictions in the test (B) and test (F) sets are close to the maximum and minimum values in the training set, respectively. Particularly noteworthy is that AdaBoost, RF, and XGBoost models exhibit piecewise functional data relationships as the hyperparameters are varied, therefore, conjecturing that this may be the reason for the extrapolation-failure of the model developed by ML methods involving tree algorithms, i.e., having constant output values for a certain range of input values.
Furthermore, for samples far from the domain of definition of the training set, the predicted values of Ridge and SVM models developed with linear or nonlinear multivariate functional relationship data differ significantly from their observed values (Fig. 2e and f). This emphasizes the fact that the reliability of the predicted values of a model for dataset out-of-distribution samples is related to the distance between the sample independent variable and the domain of definition of the training set. If the ED in the EV is small (i.e., the extrapolation ability contributes less) and the extrapolation ability is worse (i.e., the performance of the test (EV) set is worse), while the ED of the predicted sample is large, one can consider the reliability of the model predicted values to be low in this case.
Fig. 4 Schematic diagram of the model architecture containing 10 DTs (each with a depth of 4) for data with linear univariate relationships via the RF method. |
When having multiple independent variables, the dependent variable is a combined transformation of values within these closed intervals. Furthermore, the value domain constituted by the combined transformations of the potential maximum to minimum predicted values of all the independent variables is a closed interval, which may be the reason for extrapolation-failure of the models developed by ML methods involving tree algorithms. It should be noted that because the regression models involving tree algorithms have low extrapolation ability, they may have difficulty discovering novel materials or chemicals.
Fig. 5 Results of extrapolation validation (EV) for the ML model developed with data from (a) linear univariate, (b) linear multivariate, and (c) nonlinear multivariate relationships. |
Fig. 6 11 PI-Tg model extrapolation validation (EV): (a) overall result and (b) average RMSE statistical value. |
In the case of EV instance for PI-Tg models, the RMSEtest(EV) for forward and backward serialized extrapolation validations for models established by the involving tree algorithm is always larger than that for models established by the non-tree-involving algorithm, for every I and h (Fig. 6a). To evaluate the overall extrapolation ability of the model, the average of RMSE for the all independent variable and h extrapolation validation, i.e., the average RMSEtest(EV) and the average RMSEtest(model), is used as the statistical parameter. The average RMSEtest(EV) of MLP, MLR, Ridge, GPR, and LASSO is around 20 °C (Fig. 6b), which is close to the experimental measurement error and acceptable.51 By contrast, the average RMSEtest(EV) of the models developed based on RF, KNN, GBDT, XGBoost, and AdaBoost methods is larger, with around 40 °C (Fig. 6b), which suggests that these models involving tree algorithms have relatively poor extrapolation ability.
Furthermore, σ95 is presented as a threshold for the evaluation of the extrapolation ability. If the RMSEtest(EV) of an independent variable is greater than σ95, then the prediction error of the model may be greater than the difference between the actual value and the mean of the samples within the 95% confidence level interval. The RMSEtest(EV)s of the ML methods established by involving tree algorithm are generally high, and with several RMSEtest(EV)s are even near the σ95 (60.35 °C; Fig. 7f–j), for instance, the results of the AdaBoost model of forward extrapolation validation with I10, I18, I3, and I5 and the backward extrapolation validation of I9, and I8 (Fig. 7f), the XGBoost model of I10 and I3 forward serialization extrapolation and I9 backward serialization extrapolation (Fig. 7g). This indicates that the prediction error of this model even exceeds AE when the above-mentioned independent variable in the sample exists far away from the corresponding domain of definition of the training set.
The MLP, MLR, LASSO, GPR, and Ridge models have RMSEtest(EV) of any I close to the corresponding RMSEtest(model) (Fig. 7a–e), which indicates their good predictive ability. Furthermore, I10, I13, and I9 of the SVM model have small backward EDs along with large RMSEtest(EV), therefore, in applying this model, if there are I10, I13, and I9 in the prediction samples that are smaller than the minimum in the corresponding training set, the predicted values may be unreliable, i.e., extrapolation of such an independent variable is not recommended. In contrast, the I5 in the SVM model has a high forward ED but a small RMSEtest(EV). It means that this independent variable has little effect on the prediction reliability of the model when it exceeds the definitional domain of the corresponding training set, therefore, the predicted value of the sample can be considered reliable when such independent variables are extrapolated.
The EV method is independent of the architecture of the developed model. Essentially, the EV method is a pioneering dataset division scheme that is based on the range of each independent variable/descriptor/dimension in the training set. It evaluates the extrapolation of each variable by serializing each independent variable (i.e., sorting from the smallest to the largest and from the largest to the smallest) and later dividing the training and test sets. Some predictive models developed based on ML architectures such as advanced generative adversarial network (GAN), convolution neural network (CNN), and recurrent neural network (RNN) can be considered for evaluating the extrapolation ability of the models via the EV method. Meanwhile, it provides the Data Science community with some insights and solutions for evaluating the reliability of out-of-distribution sample prediction in ML models (e.g., molecular and material properties, reaction yields, etc.).
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00256j |
This journal is © The Royal Society of Chemistry 2024 |