Shivam
Saxena‡
a,
Tuhin Suvra
Khan‡
*a,
Fatima
Jalid
ac,
Manojkumar
Ramteke
b and
M. Ali
Haider
*a
aRenewable Energy and Chemicals Lab, Department of Chemical Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India. E-mail: tuhinsk@iitd.ac.in; haider@iitd.ac.in
bDepartment of Chemical Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India
cDepartment of Chemical Engineering, National Institute of Technology Srinagar, Srinagar, Jammu and Kashmir, India
First published on 13th September 2019
The advent of machine learning (ML) techniques in solving problems related to materials science and chemical engineering is driving expectations to give faster predictions of material properties. For heterogeneous catalysis applications, relying on the age-old Sabatier principle, an ab initio in silico high throughput screening of catalyst materials is envisaged, wherein ML based methods show potential to significantly reduce the experimental as well as computation cost. The availability of ML algorithms (in open source libraries like Scikit-Learn) and materials database (like CatApp and Materials Project) further augments this realization. By using these resources, ML models are developed to predict the binding energies of oxygen and carbon on bimetallic alloys and Cu-based single atom alloys (SAAs) using the features of metals that are readily available in the periodic table. Several ML models for predicting oxygen binding energy for AA terminated A3B alloys are analysed and gradient boosting regression (GBR) is observed to give superior performance with a root mean square error of 0.31 eV in the test. In addition, GBR based ML models are demonstrated to predict the oxygen and carbon binding energies of AB terminated A3B alloys with a test error of 0.38 eV and 0.35 eV respectively. The binding energy of oxygen and carbon on Cu-based SAAs is predicted with a test error of 0.36 eV and 0.37 eV respectively. Moreover, the computational time for predicting the binding energy using ML is 0.0006 s on a dual-core laptop which is significantly less than the time required for DFT calculations. DFT and ML calculated carbon and oxygen binding energies for the bimetallic A3B alloys are further used in an ab initio microkinetic model to calculate the turn over frequency (TOF) for ethanol decomposition and non-oxidative dehydrogenation reactions. The TOFs over bimetallic alloys obtained using the ML calculated binding energies follow the same trend as that obtained using the DFT energies, with the TOF values being the same or within an order of magnitude range. This shows that catalyst screening using binding energy as a descriptor can be performed using ML models, bypassing time and resources consuming DFT calculations. This is likely to speed up the process of novel catalyst discovery.
The proposed thought on catalyst design, essentially, is the recipe for in silico high throughput screening of catalyst materials to come up with bimetallic alloys which are expected to produce desired reactivity in experiments, reducing the number of experiments required in a typical hit and trial approach to synthesize active catalysts.4,6,11,19–25 However, in this recipe, the time required for DFT simulations, in general, is a limiting factor, since the quantum mechanical approach to scan the binding sites of an adsorbate on the catalyst surface and corresponding binding energy calculations is computationally expensive.26–29
To reduce the number of computationally expensive DFT calculations a linear scaling relationship for binding energy between adsorbed species10,30–33 as well as a Brønsted Evans Polanyi (BEP) scaling relationship10,17,34,35 between activation energy and reaction energy are widely used in computational heterogeneous catalysts. To speed up the DFT binding energy calculation, the use of neural network based potentials in the open source Atomistic Machine-learning Package (AMP) developed by Peterson et al.36 and the perturbation theory based alchemy approach by Keith et al.37 have shown to be thousands of times faster than conventional DFT and can be important tools for high-throughput screening of heterogeneous catalysts. Herein, an alternative machine learning (ML) approach is presented which can provide the same information on a significantly reduced time scale.
Recent progress in integrating machine learning with data obtained from DFT calculations has opened up a possibility of exploring a whole new way of high throughput catalyst screening. Towards this, efforts to integrate ML and heterogeneous catalysis commonly apply Artificial Neural Networks (ANNs).38 Ulissi et al.39–42 and Xin et al.26,27,43 used ANN based models for the prediction of the activity of transition metal based heterogeneous catalysts, whereas Kulik and co-workers44–47 applied the ANN method for transition metal based organometallics for catalysis as well as to screen spin crossover complexes for potential application as data storage devices, light sensing switches, etc. However, the challenge with training conventional ANNs is the high computation time, which increases further with more number of hidden layers and neurons.48 Another disadvantage of ANNs is their low interpretability. With algorithmic developments, improvised ML algorithms can be developed that are sometimes more accurate and much faster than ANNs. One such algorithm is Gradient Boosting Decision Trees (GBDTs)49 which incorporates important advantages of decision trees while using “Boosting” to overcome their biggest drawback – poor predictive performance. Decision trees are adaptable and easy to interpret, can handle different types of variables, need very less pre-processing of data and can fit nonlinear relationships accurately.50 Boosting is a technique that is used to convert many weak learners to form a single strong learner.51 The advantages provided by GBDTs have made it one of the most widely used ML algorithms.49
In the selection of an ML model, features are an important constituent.27 Once the dataset of various catalysts is available, it has to be described by features that uniquely represent the catalyst and relate it to the target variable. There have been important contributions in this regard to predict the binding energies as a target variable in order to screen bimetallic catalysts. ANNs have been used to predict binding energies using the electronic properties of alloys like d-band centers as features for the model.26,27 Tree based ensemble algorithms have shown significantly accurate prediction of binding energies of CH4 related species on Cu-based alloys using only readily available physical properties of metals as features.52
Here, in this study, an ML based model is developed to predict the binding energy of ‘descriptor’ oxygen and carbon atoms on bimetallic alloys of the form A3B (211 surface). Various ML algorithms are evaluated to put forward the advantage of GBR over others for the supervised regression problem. Additionally, the GBR model is developed to predict the binding energies of oxygen and carbon over copper based single atom alloys (SAAs). The ML model developed using readily available properties of metals as features is observed to predict the binding energy with accuracy equivalent to that of DFT calculations. Also, the computation time required for this ML model prediction is negligible as compared to DFT calculations. The ML predicted binding energies were further used with an ab initio microkinetic model (MKM) to calculate the catalytic rates for two important catalytic reactions: ethanol decomposition13 and non-oxidative dehydrogenation (NODH) reactions15 over the A3B bimetallic alloys. Both reactions were earlier studied by us in detail by constructing MKMs for understanding the trend in the catalytic activity of transition metals for C–O bond scission in ethanol13 and NODH of ethanol to produce acetaldehyde15 on undercoordinated step (211) sites. Here in this study, the ML calculated binding energies showed similar predictions for the reactivity of bimetallic alloys to those earlier shown by ab initio MKMs. These findings can ultimately be extended to other metal alloys and catalytic reactions to provide a faster way of catalyst screening.
Each bimetallic alloy is represented by a set of features that uniquely describe it. For A3B bimetallic alloys, a set of 27 features are chosen to include the physical properties of both the metals in the alloy. These properties are readily available from the periodic table and other databases.57–59 Overall, each alloy is depicted by a feature vector comprised of 27 values. For Cu-based SAAs, each alloy is represented by a set of 12 features that include the physical properties of the single atom in the alloy. Features related to Cu are not included as they would be constant for all the SAAs used in the study.
Except ANNs, all other ML algorithms are implemented using the widely used open-source library Scikit-Learn.60 ANNs are implemented using Keras61 with a TensorFlow62 backend. For evaluating the predictive power of the ML algorithms, the dataset is first split into two parts, train data and test data. All the ML models are built including all the features as input. The models are tested using 5-fold cross validation and by 100 times repeating the random splits of train and testing data (80%/20%) so as to avoid any data biasing. The accuracies of the predictions are calculated by averaging the root mean square errors (RMSE) of those 100 trials. Since the values of hyperparameters are expected to affect the accuracy of the model, a range of hyperparameters are tested for each model using GridSearch in Scikit-learn. The set of hyperparameters tested for each model are illustrated in Table 1. The optimum set of hyperparameters for each ML model is obtained via grid-search in Scikit-learn using 10-fold cross validation. The RMSE error for the optimum hyperparameter values for each model for predicting oxygen binding energy on AA terminated alloys is then compared to determine the best ML model. Models like Linear Regression, K-Nearest Regression, Support Vector Regression and Neural Networks needed feature scaling.63 To implement this, features are standardized by removing the mean and scaling to unit variance for the algorithms that need feature scaling.
ML algorithm | Hyperparameters tested |
---|---|
Linear regression | Non-parametric |
Ridge regression | Alpha = [0.1, 0.5, 0.8, 1, 10, 100] |
K-nearest regression | n_neighbors = [5, 10, 20], weights = [‘uniform’, ‘distance’] |
Support vector regression | Kernel = [‘rbf’], C = [1, 10, 100, 1000, 10000, 100000], gamma = [0.1, 0.01, 0.001, 0.0001, 0.00001] |
Random forest regressor | n_estimators = [50, 100, 200, 300, 400, 500, 600, 700, 800], max_depth = [2, 3, 4, 5, 6, 7, 8] |
Extra tree regression | n_estimators = [50, 100, 200, 300, 400, 500, 600, 700, 800], max_depth = [2, 3, 4, 5, 6, 7, 8] |
Gradient boosting regression | n_estimators = [50, 100, 200, 300, 400, 500, 600, 700, 800], max_depth = [2, 3, 4, 5, 6, 7, 8], learning_rate = [0.1, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8] |
Artificial neural network | Number of layers = 3, neurons in each hidden layer = 50, activation = [tanh, ReLu], loss = [mean_squared_error, mean_absolute_error], optimizer = [sgd, RMSprop] |
The final test and accuracy of the ML model in predicting the TOFs for applied catalytic reactions are directly evaluated using the MKM implemented with the descriptor based analysis tool CatMAP.6 In the CatMAP software package, a multi-dimensional Newton's root finding method from the Python mpmath library is implemented to obtain steady-state solutions of the governing differential equations and the production rate is calculated as catalytic TOFs. The steady state kinetics in the ab initio MKM method is determined using the mean field approach by solving all the rate equations without making any assumption of the surface coverage or rate-determining step. The MKM is constructed using the reaction energetics obtained from DFT calculations and also with ML predicted binding energies. The binding energies of the adsorbed species and transition state and gas phase species used in the model are obtained from our previous MKM studies of ethanol over the transition metals.13,15 The gas phase energies of hydrogen, water and methane are taken as a reference for expressing the energies of all species. A similar methodology and reaction conditions are employed to those in a previous DFT based MKM study of ethanol decomposition13 and NODH.15 Carbon and the oxygen binding energies are taken as the descriptors for both the reaction models. For the ethanol decomposition reaction, a comparison with the ML based model is made under the following reaction conditions: T = 523 K and P = 2 bar with a 1:1 ratio of hydrogen and ethanol in the inlet stream.13 For NODH, the reactions conditions considered are: T = 473 K with 10% conversion of ethanol and P = 1 bar.15
Thus, the prediction of binding energy is a complex non-linear problem. ML algorithms with their ability to learn complex non-linear interactions can therefore be used for predicting the binding energies of these bimetallic alloys.
Supervised learning is a type of ML task where the algorithm learns an inferred function from already labelled data. This inferred function is then used to predict the target value from new data. The “No Free Lunch” theorem in ML states that there is no one model that works best for all problems.66 Hence, it is always advisable to try multiple ML algorithms to identify which model works best for a particular problem. A number of widely used ML algorithms were evaluated which can be classified as linear models, distance based models, support vector machines, tree based ensemble algorithms and neural networks.
For the prediction of oxygen binding energies on AA terminated A3B bimetallic alloys, all models were initially tested by giving all 27 features as input. The optimum hyperparameters are obtained by using grid search with 10-fold cross validation for each algorithm as given in Table 2. Mean training and testing errors for each algorithm (for tuned hyperparameter values) are also shown in Table 2, along with the minimum and maximum errors for 100 trials. In each of these 100 trails, the data are split randomly into train and test data. The model is built on the training data and training error is evaluated on the same training data while the test error is evaluated on the testing data. The RMSE errors in eV for all the algorithms tested are shown in Fig. 4 and listed in Table 2.
ML algorithm | Tuned hyperparameter value | Train error mean (min, max) | Test error mean (min, max) |
---|---|---|---|
Linear regression | Non-parametric | 0.43 (0.37, 0.46) | 0.55 (0.43, 0.75) |
Ridge regression | Alpha = [0.5] | 0.44 (0.39, 0.48) | 0.53 (0.36, 0.69) |
K-nearest regression | n_neighbors = 5, weights = ‘distance’ | 0 (0, 0) | 0.54 (0.35, 0.77) |
Support vector regression | Kernel = [‘rbf’], c = [1000], gamma = [0.001] | 0.23 (0.17, 0.26) | 0.34 (0.24, 0.53) |
Random forest regression | n_estimators = [400], max_depth = [6] | 0.16 (0.13, 0.18) | 0.35 (0.22, 0.52) |
Extra tree regression | n_estimators = [100], max_depth = [6] | 0.14 (0.10, 0.17) | 0.32 (0.18, 0.47) |
Gradient boosting regression | n_estimators = [400], max_depth = [3], learning_rate = [0.3] | 0.003 (0.003, 0.003) | 0.31 (0.2, 0.44) |
Artificial neural network | Activation = [Relu], loss = [mean_squared_error], optimizer = [sgd] | 0.19 (0.14, 0.23) | 0.39 (0.25, 0.54) |
Fig. 4 The train error and test error for the evaluated ML algorithms for predicting oxygen binding energies on A3B bimetallic alloys. |
Linear models tested include ordinary linear regression (OLR) and ridge regression. The OLR involves predicting the target variable as a linear function of the input features. It can be mathematically represented as
y(x, w) = w0 + w1x1 + … + wnxn |
They are easy to model and form the basis of many sophisticated ML algorithms.67 Since the model is a linear function of input features, it only looks for linear relationships between the features and target value. As discussed before, the prediction of binding energy for bimetallic alloy is a non-linear problem and hence, a large test error of 0.55 eV and 0.53 eV is obtained for OLR and ridge regression respectively as given in Table 2.
The distance based model, k-nearest regression, is one of the simplest machine learning models. It is a non-parametric model where the principle is to predict a target by looking at the properties of its nearest neighbours in the training set.68 Despite being simple and easy to interpret, these distance based models have been proven to be successful in various applications.69 However, since the model computes distances every time a prediction needs to be made, it suffers from a poor run-time performance. Also, it is very sensitive to erroneous data and irrelevant features.69 We obtained a large test error of 0.54 eV (Fig. 4 and Table 2) for the prediction of oxygen binding energies proving that distance based models do not work well for this problem.
Support Vector Regression (SVR) is a ML algorithm that uses high dimensional feature space to predict functions using a set of support vectors. Instead of minimising the training error during learning, it minimises the generalisation error.69 It has been applied successfully to various problems like optical character recognition (OCR) and time series prediction.69 A similar ML algorithm, Kernel Ridge Regression (KRR), is used for high-throughput screening of transition metal based organometallic catalysts for Suzuki cross-coupling reactions between alkyl substituted ethylene and vinyl bromide by Meyer et al.70 The author studied >2500 organometallic complexes (6 different transition metals and 91 different ligands) and found 557 catalysts to satisfy the set criteria of the descriptor (reaction energy for the oxidative addition of vinyl bromide to the organometallic catalyst) within the range between −1.39 and −1.0 eV. The drawbacks of SVR include a high algorithmic complexity and extensive memory requirements,69 whereas KKR is known for non-sparse solutions, leading to scalability problems for large data sets.71,72 The RMSE error observed for SVR is 0.34 eV (Table 2), which is much better as compared to linear as well as distance based models.
The ensemble based algorithms used include Extra Tree Regression (ETR), Random Forest Regression (RFR) and Gradient Boosting Regression (GBR). The underlying goal of all the ensemble algorithms is to combine predictions from several weak estimators to construct a strong estimator. These algorithms differ in how they construct a number of decision trees to eventually build an ensemble. RFR builds ensembles of decision trees where each tree is built on a random selection of examples from the training data. Additionally, RFR adds randomness while constructing these trees. Instead of choosing the most important feature while splitting a node, it chooses the most important feature from a random sample of features.73 ETR adds an extra layer of randomness to RFR.74 The final prediction for a new input is made by averaging the predictions by each of the trees in the ensemble for both the algorithms. These decision tree based ensemble methods can capture linear as well as non-linear complex relationships73 and thus RMSE error values observed for ETR and RFR are 0.32 eV and 0.35 eV respectively (Table 2). These tree based ML algorithms have been shown to be the best in predicting the binding energy of CH4 related species (CH3, CH2, CH, C and H) over Cu based alloys.52
Artificial Neural Networks (ANNs) have been inspired by the biological neural networks in the brain.75 It consists of multiple interconnected nodes that are loosely modeled on neurons. Due to their ability to fit non-linear and complex data, their robustness to noise and adaptive learning, they have proven to be predictive in solving various complex real world problems.76 However, they have received criticism because of their behaviour as a “black-box” – being hard to interpret and due to the requirement of high computational resources.77 The observed RMSE error for ANNs is 0.39 eV (Table 2), which is higher as compared to SVR and tree based ensemble algorithms.
It can be seen from the error values that GBR performed better than any other model used in the study with a test error of 0.31 eV (Table 2) for predicting oxygen binding energy. A simple representation of GBR has been shown in Fig. 5. GBR is an ensemble algorithm where the decision trees are learned sequentially. Initially a weak learner (decision tree) is built and then the model is improved by adding to it another learner that is built on the error (also called residual) of the last learner. In general, the next learner tries to correct the errors of its predecessor.49 The ensemble algorithms improve upon the biggest drawback of decision trees that is overfitting.49,78 They produce models that are adaptable, easy to interpret and provide better prediction than many ML algorithms.50 Also, the ability of GBR to fit non-linear relationships50 results in its better prediction power for binding energy than many other ML algorithms. All such advantages of GBR coupled with the least RMSE error obtained in our case makes it the best choice for predicting binding energy. Thus, further analysis has been done using GBR only.
The RMSE calculated over 100 trials for GBR is comparable to the error in binding energy calculations via DFT (∼0.3 eV).55,79,80 The training and testing errors are evaluated by increasing the number of trials from 100 to 200 and 300. In each trial, a random split of data into training and testing data is performed. The RMSE errors were observed to be consistent averaged over 200 trials and 300 trails. This proves that the accuracy of the model remains stable even when the number of trials is increased.
Experiments are generally performed to measure the binding energy using temperature programmed desorption (TPD) and adsorption isotherms. More advanced and accurate experiments are now performed to measure binding energies using single crystal surfaces, which are well adapted in the work of Somorjai,88 Ertl89,90 Masel,91 King92 and Campbell.93 However, all of them are time consuming and DFT calculations certainly help in speeding up the process. In this regard, a significant advantage of using ML over DFT calculations and tedious experiments is obtained in the reduction of both computational time and resources for the estimation of the binding energies of species on a catalyst surface. The average computational time taken for the calculation of a 4 × 4 (111) surface having 64 metal atoms is 2100 s with 8 CPUs (2.5 GHz/12-Core with 62 GB RAM), whereas the time taken for the calculation of systems with adsorbates is on average 4000 s. The time taken for the calculation of gas-phase CH4, H2O and H2 is in the range of ∼100 s. In summary, on an average, the total computational time taken for single adsorption energy calculation is in the range of 6000 s or 100 min on 8 high performance CPUs. Meanwhile, the prediction of one adsorption energy value using GBR (after the model is built) is 0.0006 seconds on a dual-core laptop. Even when the complete computational time of the process is calculated, which includes the time taken for hyperparameter optimisation and then the time to calculate the test error for 100 random splits of test/train data, it takes about 480 seconds or 8 minutes on a dual-core laptop for the GBR model built for predicting oxygen binding on the dataset of A3B bimetallic alloys. Thus, even if we start to build a new ML model for a completely new dataset, ML models would save a great amount of time and resource.
Machine learning has also been used to predict the binding energy of CH4 related species over Cu-based bimetallic alloys.52 The features used in the study were the physical properties of the other metal in the Cu-based alloy. The use of physical properties that are readily available in the literature makes this model much more interpretable and universal. Moreover, it facilitates the rapid discovery of new alloys as the features of every alloy are readily available. We build upon these features (which include the group, period, atomic number, atomic mass, atomic radius, electronegativity, melting point, boiling point, density, heat of fusion, ionization energy and surface energy of the catalyst elements) and extend it to all bimetallic alloys of the type A3B, thus building a universal model which can be used to predict the binding energy of oxygen and carbon over any bimetallic alloy of the form A3B. Additionally, features like the d-band center, Pauling electronegativity and work function have been used to describe the A element as used by Xin et al. in their work.26,27 The relevance of such features is further consolidated by using them to predict the binding energy of oxygen and carbon over single atom alloys with the example of Cu-based SAAs. For industrial applications, bimetallic alloys are more appealing than SAAs as SAAs require sophisticated synthesis methods and current industrial bulk synthesis methods are not adequate to produce stable SAAs.98,99
In addition, a separate analysis is performed to remove the least important features from the model so as to find the test error. On removing the features, the error is observed to increase, suggesting that the set of 27 features used collectively predicts the adsorption energy better. For oxygen binding energy prediction for AA terminated alloys, the test error obtained with the model using 27 features is 0.31 eV. For a model built with only the top 25 features, the test error is 0.32 eV, and for a model built with the top 20 and top 15 features, the test error is increased to 0.33 eV. This is tabulated in ESI Table SI-1.†
Fig. 6 Correlation plot for oxygen and carbon binding energy with the features of the (a) “A” metal and (b) “B” metal in the AA terminated A3B bimetallic alloy (211) surface. |
Fig. 7 Relative feature importance for the GBR model for predicting (a) oxygen and (b) carbon binding energies for the AA terminated A3B bimetallic alloy (211) surface. |
In order to optimize the ratio of test/train data split, additional analysis is performed to measure the RMSE of the model for test/train split ratios of 15/85, 20/80, 25/75, 30/70, and 50/50. The errors obtained in the above-mentioned cases are tabulated in Table 3. The test error increases as the ratio of test/train data is increased. As this ratio is increased, the amount of data available for training the model decreases. This resulted in the reduction of the accuracy of the model. Thus, it can be seen that the ML model improves with the availability of more training data. This also indicates that if we include more train data in our model, it should further decrease the RMSE error obtained. This increase in data could be achieved either by adding more number of alloys or including more relevant features for each alloy. The deviation of the predicted values from DFT calculated values for GBR for different ratios of test/train data is presented in Fig. 8.
Test/train split | Train error | Test error |
---|---|---|
15%/85% | 0.0003 | 0.31 |
20%/80% | 0.0003 | 0.31 |
25%/75% | 0.0003 | 0.33 |
30%/70% | 0.0003 | 0.35 |
50%/50% | 0.0003 | 0.4 |
Another ML model was built to predict carbon binding energies for the AA terminated A3B bimetallic alloys. Since we have established the relevance of GBR in predicting the binding energy of oxygen, only a GBR model was fitted for this prediction. In the data, instead of the oxygen binding energy, the carbon binding energy of AA terminated A3B alloys was the target variable and the rest of the input features remained the same. Again, the optimum hyperparameters were obtained using grid search with a 10-fold cross validation in Scikit-learn. The test error for the optimized model averaged over 100 trials was found to be 0.34 eV. In each of these trials, the test and train data were chosen randomly. This proves that the GBR ML model can be effectively used to predict carbon binding energies for these bimetallic alloys with accuracies equivalent to that of DFT calculations. The deviation of the predicted values from DFT calculated values for GBR for different ratios of test/train data can be seen in Fig. SI-3† and the error obtained for different ratios of test/train splits are available in Table SI-2.† The features used were the same as those used for oxygen binding energy calculation. The correlation matrix of features of “A” metal and “B” metal for A3B bimetallic alloys with the carbon binding energy can be seen in Fig. 6(a) and (b). The feature importance of this model was again calculated and is presented in Fig. 7(b). It can be seen that the top features for predicting the carbon binding energy remain almost similar to those for predicting oxygen binding energy. The surface energy of the dopant is still the most important feature followed by ionization energy and density. The fact that the most important features remain almost similar shows that these physical features are highly correlated with the binding energy. Moreover, the ML model is able to identify this correlation and predict the binding energies for both carbon and oxygen over AA terminated A3B bimetallic alloys.
Fig. 9 Relative feature importance for the GBR model for predicting (a) oxygen and (b) carbon binding energies for the AB terminated A3B bimetallic alloy (211) surface. |
Three most common features with high importance for binding energy over transition metal surfaces are surface energy, ionization energy and electronegativity. Surface energies are related to the degree of coordinative unsaturation of the surface metal atoms. In general, a system with higher surface energy is indicative of a surface with higher reactivity. The ionization potential and electron affinity are related to the ease of electron transfer between the surface metal atoms and the adsorbate and known to be important parameters for chemical bonding. Though ML models are mostly used as a black-box model driven by data, an added advantage of GBR feature importance is the underlying physics that is captured from this model.
All the models are tested with 12 features as input which include the physical properties (group, period, atomic number, atomic mass, atomic radius, electronegativity, melting point, boiling point, density, heat of fusion, ionization energy and surface energy) of single atoms in the alloy. The correlation matrix of features of “B” metal for Cu-based SAAs for the carbon and oxygen binding energy can be seen in Fig. 11. The features describing Cu are not included as they would remain constant for all the alloys used in the model. Again, a similar procedure is followed to get the optimized GBR models as mentioned for the case of A3B bimetallic alloys. A set of hyperparameters are tested using 10-fold cross validation in order to obtain the best hyperparameters for the GBR model. The test error for the optimized models averaged over 100 trials for oxygen binding energies and the carbon binding energies are 0.36 eV and 0.37 eV respectively. This result again shows the effectiveness of the GBR model for the prediction of binding energies. The deviation of the DFT calculated values from the predicted values for different test/train ratios for predicting oxygen and carbon binding energies can be seen in Fig. SI-6 and SI-7† respectively. Errors obtained at the above different test/train ratios are tabulated in Tables SI-5 and SI-6† for oxygen and carbon binding energies over SAAs respectively. This again illustrates that increasing the train data improves the prediction of the GBR model.
Fig. 11 Correlation plot for oxygen and carbon binding energy with the features of the single atom metal in the Cu-based SAA. |
The feature importance for the optimized models for the prediction of oxygen and carbon binding energies is shown in Fig. 12(a) and (b) respectively. The most important feature for the prediction of carbon binding energy is still the surface energy of the element. The rest of the features have almost similar relative importance. This again shows the high correlation between surface energy and the binding energy; and the adequacy of the ML model to identify this correlation for prediction. However, for the prediction of oxygen binding energies, both the group and surface energy of the single atom have similar importance. The importance of the group in the prediction is also observed in the study by Takigawa et al.,52 where it is the most important feature for predicting the binding energies of H and CH2 over Cu-based alloys.
Fig. 12 Relative feature importance for the GBR model for predicting (a) carbon and (b) oxygen binding energies for the Cu-based SAA. |
Fig. 13 depicts the turnover of the products of reaction for ethanol decomposition over DFT as well as ML based alloy binding energies. The reaction conditions are considered to be T = 573 K and P = 2 bar with an inlet stream ratio of 1:1 for ethanol and hydrogen. Ethane is formed upon C–O scission of ethanol, whereas methane is the C–C scission product as shown in Fig. 13(a) and (b) respectively. It can be inferred from Fig. 13 that the activity trend over the alloys remains the same for DFT and ML based alloy energies. The activity trend for DFT based alloy energetics is Co3Ru ∼ Ni3Fe ∼ Co3Ni ∼ Co3Fe > Ni3Cu > Ni3Pt > Ni3Rh,13 whereas that based on ML alloy energetics is Co3Ru > Ni3Fe ∼ Co3Ni ∼ Co3Fe > Ni3Cu ∼ Ni3Pt ∼ Ni3Rh. The turnover of the C–O scission product (ethane) over the alloys is given in Table 4. For Co3Ru, Ni3Rh, and Ni3Pt, the calculated TOF remains the same for both the DFT and ML energetics. However, for the rest of the alloys, the TOFs are underpredicted for energies obtained from ML by an order of magnitude. An error bar of 0.3 eV is considered for C and O binding energies on the metals and metal alloys as this is comparable to the error found in the ML model tested in the study.
Fig. 13 Comparison of TOFs of ethanol decomposition to produce (a) ethane and (b) methane on Co and Ni based bimetallic alloys as obtained from DFT (circle) and ML (square). Error bar = 0.3 eV. |
Alloy | TOF [DFT] | TOF [ML] |
---|---|---|
Ni3Fe | 10−3 s−1 | 10−4 s−1 |
Ni3Rh | 10−5 s−1 | 10−5 s−1 |
Ni3Cu | 10−4 s−1 | 10−5 s−1 |
Ni3Pt | 10−5 s−1 | 10−5 s−1 |
Co3Fe | 10−3 s−1 | 10−4 s−1 |
Co3Ni | 10−3 s−1 | 10−4 s−1 |
Co3Ru | 10−3 s−1 | 10−3 s−1 |
Furthermore, a similar comparison study is conducted for the NODH of ethanol at 473 K and an initial conversion of 10%. Fig. 14(a) and (b) show the volcano plots for the reaction products: acetaldehyde and ethylene over the alloys as well as transition metals for descriptor energies obtained from DFT15 and ML respectively. The TOFs of acetaldehyde over the alloys are listed in Table 5. For Ni3Sn, Cu3Rh and Cu3Pd, the TOF remains the same for both DFT and ML; however for Cu3Ni and Cu3Pt, a difference of an order of magnitude is observed in the TOF between the two methods. Overall, it can be concluded that the obtained trend in TOFs on alloys is similar and the difference in individual TOF values between ML and DFT predictions is not more than an order of magnitude. Therefore, increasing the input alloy data set of DFT calculations for ML models can thereby enhance the accuracy of ML. ML models in combination of MKM gives direct access to an in silico catalyst screening process, where GBR based ML models can be applied in tandem with already existing database and MKM models to screen potential bimetallic alloy or single atom alloy catalyst candidates faster than conventional DFT based methods, without compromising the accuracy.
Fig. 14 Comparison of TOFs for the NODH of ethanol to produce (a) acetaldehyde and (b) ethylene on bimetallic alloys as obtained from DFT (circle) and ML (square). Error bar = 0.3 eV. |
Alloy | TOF [DFT] | TOF [ML] |
---|---|---|
Ni3Sn | 104 s−1 | 104 s−1 |
Cu3Rh | 104 s−1 | 104 s−1 |
Cu3Ni | 104 s−1 | 103 s−1 |
Cu3Pt | 104 s−1 | 103 s−1 |
Cu3Pd | 104 s−1 | 104 s−1 |
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9ta07651d |
‡ Equal contribution. |
This journal is © The Royal Society of Chemistry 2020 |