Open Access Article
Pallavi
Dandekar‡
a,
Aditya Singh
Ambesh‡
a,
Tuhin Suvra
Khan
b and
Shelaka
Gupta
*a
aMMEC Lab, Department of Chemical Engineering, Indian Institute of Technology Hyderabad, Kandi, Sangareddy-502285, Telangana, India. E-mail: shelaka@che.iith.ac.in
bClimate Change and Data Science Division, CSIR – Indian Institute of Petroleum, Dehradun-248005, India
First published on 20th March 2025
Data driven machine learning (ML) based methods have the potential to significantly reduce the computational as well as experimental cost for the rapid and high-throughput screening of catalyst materials using binding energy as a descriptor. In this study, a set of eight widely used ML models classified as linear, kernel and tree-based ensemble models were evaluated to predict the binding energy of catalytic descriptors (CO* and OH*) on (111)-terminated Cu3M alloy surfaces using the readily available metal properties in the periodic table as features. Among all the models tested, the extreme gradient boosting regressor (xGBR) model showed the best performance with the root mean square errors (RMSEs) of 0.091 eV and 0.196 eV for CO and OH binding energy predictions on (111)-terminated A3B alloy surfaces. Moreover, the xGBR model gave the highest R2 scores of 0.970 and 0.890 for CO and OH binding energies. The time taken by the ML predictions for 25
000 fits for each model was varied between 5 and 60 min on a 6 core and 8 GB RAM laptop, which was very negligible as compared to DFT calculations. Our ML model showed remarkable performance for accurately predicting the CO and OH binding energies on a (111)-terminated Cu3M alloy with a mean absolute error (MAE) of 0.02 to 0.03 eV compared to DFT calculated values. The ML predicted binding energies can be further used with an ab initio microkinetic model (MKM) to efficiently screen A3B-type bimetallic alloys for the formic acid decomposition reaction.
An increase in the speed of determining the descriptor variable would definitely accelerate the discovery of catalysts. In this regard, quantum ML models provide instantaneous access to the descriptor.16,17 Based on the physical and chemical characteristics of materials, ML models have been used for the screening of large material databases to find materials that may perform well.18 Gradient Boosting Regression (GBR)19 was used for the prediction of CO binding energy on Pt nanostructures.20 Liu et al. used ML models for the prediction of the binding energies of C and O atoms on bimetallic surfaces and used these binding energies for the steam methane reforming reaction.21 Saxena et al. also used an ML based approach for predicting C and O adsorption energies on Cu-based bimetallic surfaces and used these predictions in an MKM for the ethanol decomposition reaction.22 Wang et al. used a similar approach to screen sulphur resistant catalysts for the steam methane reforming reaction by using an ensemble model to predict the binding energies of C, H, S and O on bimetallic surfaces and combined these predictions with the MKM model.23
Formic acid has been suggested as a suitable material for H2 storage.24,25 On transition metal catalysts, formic acid decomposes either via dehydrogenation to form H2 and CO2 or dehydrates to produce CO and H2O.26–28 But, CO production via the latter pathway deactivates the catalyst surface. In order to develop formic acid as a H2 storage material, efficient catalysts that can easily decompose formic acid to H2 and CO2 are needed. Towards this, pure Cu has been observed to selectively catalyse the dehydrogenation reaction,29 but with reduced rates.30 On the other hand, Cu-based bimetallic catalysts such as Cu3Pt have shown good activity towards formic acid dissociation into H2 and also inhibit CO poisoning.31–33 Furthermore, binding energies of all the reaction species formed during formic acid decomposition can be scaled with CO and OH adsorption energies as they are already involved in the reaction mechanism, and their adsorption energies are known to correlate well with carbon and oxygen adsorption energies respectively.34,35 Therefore, in this study we have used ML with simple and easily accessible features to train prediction models that can accurately predict the binding energy of descriptors (CO and OH) on (111)-terminated Cu3M (where M is a guest metal) alloy surfaces. CO and OH binding energies are also the key descriptors for other reactions as well, such as CO2 reduction reactions,36,37 reverse water gas shift reactions,38,39 methanol electro-oxidation,40,41etc. Furthermore, intermediate CO binding energy on Cu based catalysts makes them selective for the above reactions.42,43 The ML predicted CO and OH binding energies in this study can be used with an ab initio microkinetic model (MKM) to calculate the catalytic rates for all the above reactions over bimetallic alloys.32 Li et al. have also used an ML based approach to predict OH and CO binding energies on (111)-terminated metal surfaces.44 However, the features used in the ML models as input were derived from DFT local density of state calculations, structural optimization and modelling iterations, which increase the time and limit the transferability of the method.21 On the other hand, utilizing easily accessible metal properties as features can effectively address the aforementioned issues. Therefore, it is essential to identify the intrinsic properties of catalysts, i.e., features that not only are closely related to the adsorption properties but also have physical meanings.45 The features should encapsulate the geometric and electronic properties of the local environment of surface-active sites while also capturing key characteristics of adsorbates. Additionally, they should be readily accessible from databases to enhance the efficiency of machine learning frameworks. Basic elemental properties, such as the atomic number, atomic radius, period number, group number, electronegativity, etc., which can be easily obtained from periodic tables and databases, have been widely used in ML for predicting alloy performance.46–52 In this study different ML algorithms were evaluated to predict the binding energy of catalytic descriptors (CO* and OH*) on (111)-terminated Cu3M alloy surfaces using the readily available metal properties in the periodic table as features and to put forward the advantage of extreme gradient boosting regressor (xGBR) over other ML models. As compared to DFT calculations, the computational time required for the ML model prediction was negligible.
:
25 composition of A and B metals, respectively.53–55 In this study, only alloys with a formation energy below 0.2 eV per unit cell were deemed potentially stable and were considered.56,57 Furthermore, surface metal's individual physical properties such as period, group, atomic number, atomic radius, atomic mass, boiling point, melting point, electronegativity, heat of fusion, ionization energy, density, surface energy etc. play an important role in the adsorbate/metal interactions. These elemental properties can be easily obtained from the periodic table and other databases.58,59 The surface energy dataset was obtained from Tran et al.60 The above-mentioned features were used to uniquely represent each bimetallic alloy and have been shown to produce sufficiently accurate results.48,61 A total of 18 distinct features for the main metal and 18 for the guest metal in the alloy were used and are shown in Table S1 (ESI†).
![]() | ||
| Fig. 1 A (111)-terminated A3B bimetallic surface for the ML model development, where A and B represent the metal elements across the periodic table. | ||
The implementation of ML algorithms was carried out by utilizing the popular open-source library, Scikit-Learn.62 All the features were used in building the above-mentioned ML models. To assess the predictive efficacy of the ML algorithms, the dataset was initially bifurcated into two subsets: training data and testing data. Various pairs of training and testing data were tested for each model with different separated ratios to gain ideal regression models. The accuracy of the models was evaluated based on the root mean-squared error (RMSE) and coefficient of determination (R2), which are defined as follows:
The accuracy of the models is also affected by the values of the hyperparameters. Therefore, for each model, a diverse set of hyperparameters as listed in Table 1, column (i) and (iii) for CO and OH binding energies, respectively, were tested and optimized via randomized search cross validation (RCV)63 in Scikit-Learn. K-Fold cross-validation (k = 10) with 50 iterations was used and repeated 50 times corresponding to 25
000 fits for different splitting ratios of training and testing data. Each split had a unique set of best hyperparameters for the corresponding RMSE, and even the lowest RMSEs across splits corresponded to different hyperparameter combinations. This suggests that while some parameters consistently impacted performance, their optimal values were sensitive to data partitioning. This variability highlights the importance of rigorous tuning and multiple resampling strategies to ensure model robustness and generalizability and avoided data biasing and overfitting. The tuning of each hyperparameter was carefully balanced to avoid over- and under-parametrized models, aiming at models with optimal predictive capabilities. The hyperparameter tuning process was guided by an iterative and adaptive approach. Initially, for each model, the most influential hyperparameters that played a crucial role in controlling underfitting, overfitting, and optimizing the bias-variance trade-off were identified. These parameters were selected based on prior research and their effectiveness in similar machine learning models.22,48,51,64 Each model was initialized with commonly used parameter ranges. After analysing the RMSE values from the initial tuning phase, the hyperparameter ranges were redefined, tailoring them to each model based on the previously observed best RMSE. This process was repeated multiple times across all chosen hyperparameters until the best possible RMSE values were achieved or no further improvement in RMSE and R2 scores was observed. This iterative tuning process, combined with RCV, allowed us to efficiently explore the hyperparameter space while adapting the search to each model's performance characteristics. Furthermore, Sobol Sequence65 helped in achieving a more systematic and even distribution of initial points in the hyperparameter space. This aided the searching algorithm to explore promising areas more efficiently, making the tuning process more effective and time saving. The most effective ML model for predicting CO and OH binding energies on (111)-terminated A3B bimetallic alloys was identified by evaluating the RMSE and R2 scores obtained at the optimal hyperparameter settings for each model. The Seaborn library66 was used for the construction of the correlation matrix. Moreover, Principal Component Analysis (PCA) was used to extract the essential patterns or information from the original features and to check the effect of dimensionality reduction on the ML model's performance.67
| S. no. | Model | Tested hyperparameters for CO binding energy | Tuned hyperparameters of the best estimator using RCV for CO | Tested hyperparameters for OH binding energy | Tuned hyperparameters of the best estimator using RCV for OH |
|---|---|---|---|---|---|
| (i) | (ii) | (iii) | (iv) | ||
| 1 | LR | No parameters | No parameters | No parameters | No parameters |
| 2 | KNN | Algorithm = [‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’] | Algorithm = [brute] | Algorithm = [‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’] | Algorithm = [ball_tree] |
| Leaf_size = np.arange (5, 50, 5) | Leaf_size = [10] | Leaf_size = np.arange (5, 50, 5) | Leaf_size = [45] | ||
| n – neighbors = np.arange (1, 20, 1) | n – neighbors = [4] | n – neighbors = np.arange (1, 20, 1) | n – neighbors = [6] | ||
| P = [1, 2] | P = [1] | P = [1, 2] | P = [1] | ||
| Weights = [‘uniform’, ‘distance’] | Weights = [‘distance’] | Weights = [‘uniform’, ‘distance’] | Weights = [‘distance’] | ||
| 3 | SVR | C = np.arange (500, 1500, 10) | C = [890] | C = np.arange (500, 1500, 10) | C = [500] |
| Kernel = [‘rbf’] | Kernel = [‘rbf’] | Kernel = [‘rbf’] | Kernel = [‘rbf’] | ||
| Degree = [2, 3, 4] | Degree = [4] | Degree = [2, 3, 4] | Degree = [4] | ||
| Gamma = [‘scale’] | Gamma = [‘scale’] | Gamma = [‘scale’] | Gamma = [‘scale’] | ||
| 4 | KRR | (Alpha) = uniform (0.01, 100) | (Alpha) = 7.6 | (Alpha) = uniform (0.01, 100) | (Alpha) = 79.71 |
| (Degree) = [2, 3, 4] | (Degree) = 5 | (Degree) = [2, 3, 4] | (Degree) = 4 | ||
| (‘kernel’) = [‘linear’, ‘rbf’, ‘poly’, ‘sigmoid’] | (‘kernel’) = [‘linear’, ‘rbf’, ‘poly’, ‘sigmoid’] | ||||
| 5 | RFR | Max_depth = np.arange (3, 15, 1) | Max_depth = [14] | Max_depth = np.arange (3, 15, 1) | Max_depth = [8] |
| n_estimators = np.arange (100, 800, 50) | n_estimators = [400] | n_estimators = np.arange (100, 800, 50) | n_estimators = [200] | ||
| 6 | ETR | Max_depth = np.arange (10, 20, 1) | Max_depth = [10] | Max_depth = np.arange (3, 20, 1) | Max_depth = [4] |
| N_estimators = np.arange (100, 400, 10) | N_estimators = [120] | N_estimators = np.arange (100, 800, 10) | N_estimators = [170] | ||
| Random_state = [42] | Random_state = [42] | Random_state = [42] | Random_state = [42] | ||
| 7 | GBR | Alpha = uniform (0.05, 1.0) | Alpha = [0.2268] | Alpha = uniform (0.05, 1) | Alpha = [0.1720] |
| Learning_rate = np.arange (0.01, 0.1, 0.01) | Learning_rate = [0.069] | Learning_rate = np.arange (0.1, 0.5, 0.01) | Learning_rate = [0.1599] | ||
| n_estimators = np.arange (100, 500, 25) | n_estimators = [250] random_state = [20] | n_estimators = np.arange (100, 800, 25) | n_estimators = [125] | ||
| Max_depth = np.arange (5, 15, 1) | Random_state = [20] | Random_state = [20] | |||
| Max_depth = np.arange (3, 15, 1) | |||||
| 8 | xGBR | Max_depth = np.arange (3, 15, 1)----- | Max_depth = [11] | Max_depth = np.arange (3, 15, 1) | Max_depth = [11] |
| Learning_rate = np.arange (0.01, 0.1, 0.01) | Learning_rate = [0.02] | Learning_rate = np.arange (0.01, 0.5, 0.01) | Learning_rate = [0.09] | ||
| N_estimators = np.arange (200, 1000, 25) | N_estimators = [625] | N_estimators = np.arange (400, 1000, 25) | N_estimators = [725] | ||
| Min_child_weight = np.arange (3, 15, 1) | Min_child_weight = [3] | Min_child_weight = np.arange (5, 15, 1) | Min_child_weight = [13] | ||
| Subsample = np.arange (0.7, 1, 0.01) |
For selecting a ML model, features play an important role.71,72 It is important to choose readily accessible but characteristic values as features that link to the target values i.e. binding energy. Regarding the choice of features for the bimetallics used in our work, we chose 36 physical properties such as the melting point, boiling point, surface energy, electronegativity, group, atomic number, etc. as mentioned in Table S1 (ESI†), 18 for the main metal (A) and 18 for the guest metal (B). These values are easily available from the periodic table and standard reference sources.58,59 Readily available physical characteristics of metals as features have been used previously as well for the prediction of binding energies of CH4 related species on Cu-based alloys using tree-based ensemble algorithms.48 Similarly, in another study conducted by Saxena et al., physico-chemical properties easily available in the periodic table were used as features in the ML model to predict the binding energy of oxygen and carbon for the screening of bimetallic and single atom alloys using the GBR model.22
000 trials are given in Table 2. During each of these 25
000 trials, the data were randomly split into train and test data. The process of splitting data into training and test sets in various ratios is fundamental in machine learning.73,74 Different ratios allow us to assess how well a model generalizes to new data, strike a balance between bias and variance, and understand overfitting or underfitting. This practice aids in tuning and optimizing models, especially when dealing with limited data.75,76 The model was constructed using the training data and the training error was calculated on the same data. Conversely, the testing error was calculated using the testing data. An examination of train and test errors (RMSE) facilitated the comparison and selection of models based on their predictive accuracy. All models consistently demonstrated optimal performance at a test/train ratio of 30/70. Moreover, the R2 scores are also given in Table 2 for all the ML models.
| S. no. | Model | For CO binding energy (eV) | For OH binding energy (eV) | ||||
|---|---|---|---|---|---|---|---|
| Train error mean (min, max) | Test error mean (min, max) | R 2 score | Train error mean (min, max) | Test error mean (min, max) | R 2 score | ||
| (i) | (ii) | (iii) | (iv) | (v) | (vi) | ||
| 1 | LR | 0.149 (0.149, 0.149) | 0.139 (0.139, 0.139) | 0.902 | 0.076 (0.076, 0.076) | 0.557 (0.557, 0.557) | 0.860 |
| 2 | KNN | 0.025 (0.00, 0.180) | 0.232 (0.215, 0.307) | 0.781 | 0.00 (0.00, 0.00) | 0.476 (0.451, 0.558) | 0.421 |
| 3 | SVR | 0.106 (0.106, 0.106) | 0.123 (0.122, 0.124) | 0.902 | 0.128 (0.117, 0.133) | 0.309 (0.303, 0.313) | 0.841 |
| 4 | KRR | 0.152 (0.150, 0.154) | 0.146 (0.142, 0.150) | 0.933 | 0.129 (0.091, 0.130) | 0.313 (0.310, 0.395) | 0.659 |
| 5 | RFR | 0.058 (0.049, 0.079) | 0.089 (0.086, 0.095) | 0.964 | 0.093 (0.086, 0.105) | 0.305 (0.282, 0.319) | 0.809 |
| 6 | ETR | 0.030 (0.004, 0.038) | 0.091 (0.087, 0.093) | 0.965 | 0.069 (0.009, 0.120) | 0.235 (0.232, 0.249) | 0.836 |
| 7 | GBR | 0.005 (0.00, 0.046) | 0.099 (0.095, 0.110) | 0.956 | 0.00 (0.00, 0.002) | 0.304 (0.276, 0.353) | 0.872 |
| 8 | xGBR | 0.004 (0.001, 0.014) | 0.091 (0.086, 0.097) | 0.970 | 0.010 (0.001, 0.028) | 0.196 (0.180, 0.216) | 0.890 |
LR is a basic statistical technique that models the association between a dependent variable and one or more independent variables by reducing the sum of squared residuals.77 Its straightforward nature and ease of interpretation make it a popular choice across diverse scientific fields.78 Using the LR model, test RMSE values of 0.139 eV and 0.557 eV were obtained for predicting CO and OH binding energies on (111)-terminated A3B type bimetallic alloys as shown in Table 2, entries 1(ii) and (v). The value for RMSE for the LR model is lower in the case of CO binding energies than KNN (0.232 eV) and KRR (0.146 eV) models, Table 2, entries 2(ii) and 4(ii) respectively. However, the RMSE values are relatively high in the case of OH binding (0.557 eV, Table 2, entry 1(v)), compared to the counterparts, which can be attributed to the inherent limitations of linear regression. LR excels in capturing linear relationships between features and the target variable.79,80 However, the prediction of binding energy entails complex nonlinear associations between the features and the response variable. This complexity contributes to the observed variations in the RMSE values.
KNN is a non-parametric distance-based model, which looks at the properties of its nearest neighbours in the training set to predict a target value.81,82 It is particularly useful for capturing complex, nonlinear relationships in data without requiring prior assumptions about the underlying distribution, though its performance can be sensitive to the choice of distance metric and the number of neighbours.83 Also, for high-dimensional data, calculating the distance between the target and nearest neighbour data points becomes computationally challenging and makes KNN inefficient. In this work also, KNN yielded high RMSEs for predicting CO (0.232 eV) and OH (0.476 eV) binding energies compared to other models as shown in Table 2, entries 2(ii) and (v).
In contrast, kernel-based methods such as the SVR model use a subset of the training data called support vectors for making prediction.84,85 It maps input features into a high-dimensional space and finds an optimal hyperplane that minimizes prediction errors within a defined margin.86 Due to its ability to handle high-dimensional data and nonlinear relationships using kernel functions, SVR is widely applied in scientific and engineering problems.87 It shows reduced sensitivity to input dimensionality and often achieves lower generalization error.88 SVR in our case outperformed linear models with the test RMSEs of 0.123 and 0.309 eV for predicting CO and OH binding energies as shown in Table 2, entries 3(ii) and (v) respectively. However, compared to tree-based models it still falls short.
On the other hand, by using a different loss function compared to SVR, KRR which is another kernel-based method is known to provide closed-form estimates.89,90 KRR is a nonlinear regression method that combines ridge regression with kernel functions, enabling it to model complex relationships by mapping data into higher-dimensional spaces.91,92 By incorporating an
2-norm regularization term, KRR helps prevent overfitting while maintaining flexibility in capturing intricate patterns, making it particularly effective for small to medium-sized datasets.93–95 However, for CO binding energy, KRR showed indications of overfitting, with a higher training error (RMSE: 0.152 eV) than the testing error (RMSE: 0.146 eV) as shown in Table 2, entries 4(i) and (ii). Despite KRR's capability to handle nonlinear relationships, its performance in predicting OH binding energy (RMSE: train/test = 0.129 eV/0.313 eV (Table 2, entries 4(iv) and (v)) was less than optimal when compared to other models.
However, tree-based ensemble methods such as RFR, ETR, GBR and xGBR were found to outperform their counterparts, such as the linear model (LR), KNN, and kernel models (SVR and KRR). This was due to their robustness to noise, ability to fit non-linear relationships, scalability in high-dimensional spaces, interpretability through feature importance insights and ease of parameter tuning.96–98 RFR is a powerful ensemble machine learning technique that builds multiple decision trees, each trained on random data subsets and features, to improve predictive performance and reduce overfitting through bagging and feature randomness.99,100 By averaging tree predictions, it effectively captures complex, nonlinear relationships. It handles both numerical and categorical data and provides a feature importance mechanism.101 However, it can be computationally expensive and may struggle with high-dimensional, sparse data such as text.102
ETR is an ensemble learning method used for regression tasks that builds multiple decision trees, like RFR, but with additional randomness to enhance robustness and reduce overfitting. Instead of selecting the best split at each node, it randomly selects a split for each feature, injecting randomness both at the sample and feature levels. This allows it to capture complex, nonlinear relationships efficiently. While it offers faster training than RFR due to the randomness in splitting, it may lead to slight performance trade-offs.103 It is effective with numerical and categorical data, handles missing data, and provides feature importance estimates but may underperform on high-dimensional sparse data, such as text.104
GBR is a powerful ensemble learning method used for both regression and classification tasks. It sequentially adds weak learners, typically decision trees, to correct the errors of the previous ones, with each new model focusing on the residuals of the previous predictions. The key idea behind GBR is that combining multiple weak learners creates a strong predictive model, leveraging the principle of boosting.105 The algorithm minimizes the loss function by approximating the gradient of the residuals with respect to the model parameters, effectively improving prediction accuracy. While GBR is highly effective at capturing complex, nonlinear relationships and can handle both categorical and numerical data, it can be prone to overfitting, especially with noisy data, and requires careful hyperparameter tuning, such as the number of estimators, tree depth, and learning rate.106
On the other hand, xGBR is an efficient and scalable version of GBR that builds an ensemble of decision trees, with each tree correcting the errors of the previous ones. It improves traditional gradient boosting by adding regularization (L1 and L2) to reduce overfitting, as well as incorporating parallelization, handling missing values, and optimizing split finding for faster computation.106,107
The above models produce more accurate and robust prediction by combining the predictions of weak learners such as decision trees to construct a strong estimator. Each model differs in the way of construction of the decision tree to build an ensemble. Remarkably, low RMSE values were obtained for all four models out of which RFR exhibited the least value of RMSE (0.089 eV) for predicting CO binding energy as shown in Table 2, entry 5(ii), suggesting its proficiency in capturing the underlying patterns in the data. Similarly, the ETR and xGBR models also showed a low RMSE of 0.091 eV (Table 2, entries 6(ii) and 8(ii) respectively). On the other hand, the GBR model gave an RMSE of 0.099 eV (Table 2, entry 7(ii)) for CO binding energy. Although RFR and ETR demonstrated superior performance in terms of RMSE values, our study uncovered a significant pattern wherein these models showed an inclination towards overfitting, particularly noticeable in the prediction of CO binding energies. Given the constraints of a small dataset, a large number of input features might have led to overfitting. This can be clearly observed by the inflated values from ML predictions (blue lines) as compared to the DFT values (orange line) shown in Fig. 3(e)–(g). In contrast, xGBR predicted values were closer to the DFT calculated values as shown in Fig. 3(h). The test RMSE value obtained for xGBR for CO binding (0.091 eV, Table 2, entry 8(ii)) was the lowest amongst all other models except RFR. Thus, xGBR was finally selected due to its ability to produce predictions that closely align with DFT calculated values. This choice was further supported by the model's capacity to mitigate inherent prediction biases in the dataset. Prior studies suggested that boosting algorithms, such as xGBR, are generally more effective at managing systematic prediction bias compared to bagging-based models like RFR and ETR.108 The test RMSE value of 0.091 eV obtained using the xGBR model for predicting CO binding energy on (111)-terminated A3B alloys in this study is much lower than those reported in the previous studies. For example, Li et al. obtained a RMSE value of 0.12 eV for CO binding energy prediction on (100)-terminated multi-metallic Cu catalysts by using an artificial neural network.71 In another study, the neural-network model trained with all available data sets of bimetallic catalysts predicted CO binding energy on Cu-based core–shell alloys (Cu3B-A@CuML) with a RMSE of 0.13 eV.109 Zhong et al. by using the Random Forest Regression Algorithm also predicted CO binding energy on Cu based alloys with a RMSE of 0.1 eV.110
Similarly, for OH binding energy predictions, all models were found to overfit (Fig. 4(a)–(e)) except for ETR, GBR and xGBR as shown in Fig. 4(f)–(h). From Fig. 4(h), it is clear that xGBR model's predictions (blue lines) and DFT values (orange lines) are very close as compared to the other models. Moreover, xGBR gave the lowest RMSE value of 0.196 eV as shown in Table 2, entry 8(v). This highlights xGBR model's ability to deliver more consistent and dependable predictions, particularly when dealing with limited datasets. A similar RMSE value of 0.188 eV was obtained for predicting OH binding energy on (111)-terminated intermetallic (A3B) and near-surface alloys by using deep learning algorithms integrated with the well-established d-band theory.64 Furthermore, the RMSE values reported in this study using the xGBR model for predicting CO (0.091 eV) and OH (0.196 eV) binding energies on (111)-terminated A3B bimetallic alloys are lower than the ones reported by Zheng et al. (0.22 eV for CO and 0.24 eV for OH).44 The accuracy of the ML model was further evaluated using R2 scores, as presented in Table 2. Higher R2 scores, closer to 1, indicate better model performance. The xGBR model demonstrated higher R2 scores (xGBR = 0.970) as compared to the other models tested (LR = 0.902, KNN = 0.781, SVR = 0.902, KRR = 0.933, RFR = 0.964, ETR = 0.965 and GBR = 0.956) as shown in Table 2, entries 1–7(iii)) for CO binding energies. Similarly, for OH binding energies the xGBR model was found to have the highest R2 scores of 0.890 for OH binding energies as shown in Table 2, entry 8(vi). The R2 score for CO binding energy predicted by the xGBR model (0.970) in this study is comparable to the one reported by Salomone et al. with the GBR model (0.970).111
Artificial Neural Networks (ANNs) inspired by the biological brain neural networks consist of multiple interconnected nodes that are loosely modelled on neurons.112 Since they can also fit non-linear and complex data and are robust to noise and adaptive learning, they have proven to be predictive in solving various complex real-world problems. In this study, ANNs were tested with extensive parameter tuning, including layer configuration, neurons per layer and a range of learning rates. However, the test RMSE values of 0.387 and 0.406 eV were achieved for CO and OH molecule binding, which were higher as compared to tree-based ensemble models and even simple statistical models. This may be attributed to a very small data set, which might lead to poor performance of ANNs. Previous studies have also shown tree-based models to be best in predicting C and O binding energies on A3B alloys.22
A ML based study conducted by Zong et al. also found the xGBR model to be the best model in predicting the hydrogenolysis barrier of large hydrocarbons without the need for additional DFT features.113 In another study conducted by Praveen et al., the xGBR model in combination with a tree booster displayed the best performance for predicting the chemisorption energy of several gas-phase adsorbates on different metal facets.114 xGBR offers several advantages, which include high precision and speed, parallel processing abilities, effective handling of missing values, and high customizability.115–117 A schematic illustration of the process flow for the xGBR model is shown in Fig. 5. All these advantages along with a lower RMSE value and a high R2 score make xGBR the best choice in this study for predicting the descriptor binding energies. Thus, for further analysis only xGBR was considered.
For a small data set, like that used in the present study, choosing an optimum split ratio becomes very crucial. So, we tried different test/train ratios of 15/85, 20/80, 25/75, 30/70, and 50/50 and performed hyperparameter tuning for improving the RMSEs of these split ratios. As shown in Table 3 (columns (i) and (ii)), with an increase in the test/train ratio to 30/70 the test error decreases. However, a further increase in the ratio resulted in an increase in the error. This is because when the training data are small, the model may fail to capture essential patterns and will not be able to generalize well. And in cases where the testing data become small, one has to compromise the reliability of the predictions. In the present study, with a 70% training set, an optimum RMSE of 0.091 eV (Table 3, entry 4(ii)) was obtained for CO binding energy. Similar trends were observed for OH binding energy predictions, wherein, for the 30/70 test/train ratio, an RMSE value of 0.196 eV was obtained (Table 3, entry 4(iv)). Fig. 6 and 7 present the deviation of ML predicted CO and OH binding energy values from DFT calculated values respectively for different test/train ratios.
| S. no. | Test/train split | For CO binding energy (eV) | For OH binding energy (eV) | ||
|---|---|---|---|---|---|
| Train error | Test error | Train error | Test error | ||
| (i) | (ii) | (iii) | (iv) | ||
| 1 | 15%/85% | 0.006 | 0.108 | 0.009 | 0.279 |
| 2 | 20%/80% | 0.007 | 0.104 | 0.006 | 0.259 |
| 3 | 25%/75% | 0.004 | 0.099 | 0.018 | 0.225 |
| 4 | 30%/70% | 0.004 | 0.091 | 0.010 | 0.196 |
| 5 | 50%/50% | 0.008 | 0.112 | 0.039 | 0.235 |
![]() | ||
| Fig. 8 Correlation plots for CO binding energy for: (a) “A” metal and (b) “B” metal and OH binding energy: (c) “A” metal and (d) “B” metal in (111)-terminated A3B bimetallic alloys. | ||
However, since the database is relatively small in the present study (less than 200 data points in the input database), a large number of input features may lead to overfitting. Thus, a separate analysis was performed to remove the least important features from the model so as to find the test error. The test error obtained for CO binding energy prediction with the xGBR model using 36 features was 0.091 eV (Table S2, entry 1, ESI†). Upon removing six highly correlated features the test error remained the same (Table S2, entry 2, ESI†). Upon reintroducing these six features as a single component using PCA, the error increased to 0.098 eV (Table S2, entry 3, ESI†). For the xGBR model built with the top 10 and top 20 features, the test error remained high (Table S2, entries 3 and 4, 0.096 and 0.093 eV respectively, ESI†), suggesting that the set of 36 features predicts the binding energy better. A similar observation was made by Shivam et al., wherein a set of 27 features were found to better predict the binding energy of O and the test error was observed to increase upon removing the features.22 This suggests that the less important features still carry meaningful and beneficial information, which helped in enhancing the model's robustness and accuracy. Furthermore, by using PCA on the reduced set of features (top 10 + 8 PCA components and top 20 + 5 PCA components) the model exhibited a further increase in RMSE (0.134 and 0.107 eV; Table S2, entries 6 and 7, ESI†). Similarly, the test error obtained for OH binding energy prediction with the xGBR model using 36 features was the least with the value of 0.196 eV (Table S3, entry 1, ESI†) as compared to the others mentioned in Table S3 (ESI†). Therefore, the model captures information better from individual features rather than from combined features (i.e. components). A similar analysis was performed for RFR and ETR, confirming that these models also leveraged information from less important features. It was observed that reducing the number of features led to a slight increase in RMSE as shown in Table S4 (ESI†), justifying the retention of all available features while managing the risk of overfitting. This analysis indicated that while some degree of overfitting was inevitable due to data limitations, it remained controlled across all models.
Fig. 9 shows the feature importance for predicting CO and OH binding energies with the xGBR model averaged over 25
000 trials. Notably, the surface energy of the main metal has the highest importance (Fig. 9(a)), followed by the main metal's melting point for CO binding energy on bimetallic alloys, which is consistent with the correlation matrix shown in Fig. 8(a). Similarly, for OH binding energy also, the surface energy of the main metal has the highest importance as shown in Fig. 9(b). This is because surface energies are intrinsically linked to the coordinative unsaturation of surface metal atoms. Typically, a higher surface energy system shows increased reactivity. On the other hand, the melting point refers to the energy required for breaking of a few bonds when an element goes from the solid state to the liquid state. Thus, stronger bonds result in a higher melting point, indicating that elements with a higher melting point may form alloys with higher binding energies. Salomone et al. also found surface energy as the most important feature for predicting CO binding energy on Cu-based alloys.111 Similarly, Takigawa et al. also found surface energy as the most important feature in the ML prediction of C and CH binding energy over Cu-based alloys.48
![]() | ||
| Fig. 9 Feature importance plots for predicting: (a) CO and (b) OH binding energies using the xGBR model. | ||
In addition to the main metal's surface energy, the main metal's electronegativity and guest metal's group were found to be important features for predicting OH binding energy on bimetallic alloy surfaces, as shown in Fig. 9(b). Features like electronegativity play a significant role in the ease of electron transfer between the surface metal atoms and the adsorbate, thereby influencing chemical bonding. The metal OH bonding is derived from the electronegativity of the elements. The binding energy decreases as we move from left to right in the periodic table due to higher electronegativity of elements towards the right.83 The metal–oxygen binding energies also vary strongly with the group number83 due to differential occupation of bonding δ-orbitals and antibonding π-orbitals. In general, the group of an element in the periodic table can influence its chemical property, which includes its tendency to form bonds with other elements. Elements from the same group have similar electronic configurations, leading to similar chemical properties. Therefore, the group number of an element can provide insights into the potential binding energy in an alloy. Ultimately, all the above-mentioned properties of alloys can directly or indirectly influence the binding energy of the alloy.
While ML models are often employed as data-driven black-box models, the feature importance in models offers an added advantage. It captures the underlying physics of the system, providing a deeper understanding of the model's predictions. This allows for a more comprehensive interpretation of the model beyond its predictive capabilities. Furthermore, plots of standard deviation for feature importance of CO and OH binding energies are shown in Fig. S1(a) and (b) (ESI†) respectively. These plots provide an estimate of the uncertainty or variability of the feature importance scores.120 The standard deviation values for the highly important features like surface energy and electronegativity were also found to be low (approximately 0.04 eV), thus indicating the reliability of our model's predictions (Fig. S1(a) and (b), ESI†).
| S. no. | Alloy | CO BE (DFT) (eV) | CO BE (ML predicted) (eV) | Deviation | OH BE (DFT) (eV) | OH BE (ML predicted) (eV) | Deviation |
|---|---|---|---|---|---|---|---|
| 1 | Cu3Au | −0.57 | −0.56 | −0.01 | 0.54 | 0.57 | −0.03 |
| 2 | Cu3Ga | −0.52 | −0.58 | 0.06 | 0.19 | 0.20 | −0.01 |
| 3 | Cu3Zn | −0.49 | −0.49 | 0 | 0.33 | 0.32 | 0.01 |
| 4 | Cu3Pd | −0.48 | −0.43 | −0.05 | 0.46 | 0.43 | 0.03 |
To further assess the robustness of the model, additional validation tests were conducted using independent datasets obtained from Shivam et al.22 to predict C and O binding energy on AA-terminated A3B type 211 surfaces. The xGBR model yielded reliable predictions on testing it with underrepresented alloy compositions (AA-terminated A3B type 211 surface), which were not well represented in the training data. The train and test RMSE values of 0.0732 and 0.411 eV were obtained for C binding energy prediction as shown in Table S6 (ESI†), which were higher compared to 0.0003 and 0.340 eV obtained by Shivam et al.22 using the GBR model. In contrast, the xGBR model outperformed the results obtained by the previous GBR model for O binding energy predictions as shown in Table S6 (ESI†), wherein the test RMSE decreased from 0.310 to 0.289 eV although the train RMSE increased from 0.0003 to 0.0035 eV. The increase in train RMSE indicates that the xGBR model experienced less overfitting compared to the previous GBR model, which had a near-zero train RMSE (0.0003) and likely memorized the training data. Further improvements in the RMSE values of the xGBR model for C and O binding energy prediction can be expected from intensive hyperparameter tuning and feature selection through PCA. This demonstrates that the xGBR model provides reliable and trustworthy predictions, especially for alloys underrepresented in the dataset.
Furthermore, the model performance was observed to be better for CO (RMSE: 0.091 eV) with 156 data points, compared to OH (RMSE: 0.196 eV) with only 69 data points, highlighting the impact of the dataset size. However, the dataset size is not the only factor influencing model performance. When a switch was made from (111)-terminated A3B type bimetallic alloys to AA-terminated A3B alloys on the (211) surface and the target variable was also changed from CO to C and OH to O for additional validation tests using independent/unseen datasets, the xGBR model still delivered reasonable predictions (0.411 and 0.289 eV, Table S6, ESI†), particularly for O even with these substantial changes in the dataset. This demonstrates the flexibility of the xGBR model, which when appropriately tuned and explored further can perform well and yield reliable results even with limited data.
Overall, from this study xGBR turns out to be the best predictive model for CO and OH binding energies on Cu-based bimetallic alloys with the small available dataset. The high level of accuracy and robustness shown by the xGBR model will enable the high throughput screening of bimetallic alloys to accelerate the catalyst discovery for various catalytic reactions such as the reverse water gas shift reaction, CO or CO2 reduction, methanol electro-oxidation, formic acid decomposition, etc.
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4cp04887c |
| ‡ These authors contributed equally to this work. |
| This journal is © the Owner Societies 2025 |