Active learning with non-ab initio input features toward efficient CO2 reduction catalysts

In this work, we propose the use of the d-band width of the muffin-tin orbital theory (to account for the local coordination environment) plus electronegativity (to account for adsorbate renormalization) as a simple set of alternative descriptors for chemisorption which do not require ab initio calculations for large-scale first-hand screening.


Introduction
Understanding and predicting the energetics associated with bond-forming and bond-breaking reactions occurring on the surface of solid materials is the central theme of heterogeneous catalysis research. Among many other catalysis theories, in particular, the Sabatier principle 1 is an important simple concept that states that the chemisorption strength of key reaction intermediates on catalyst surfaces should be just right to maximize catalytic activity; either too weak or too strong binding leads to insufficient activation of the reactant or great difficulty in product desorption aer catalysis completion, respectively, and therefore a typical volcano activity relationship can be plotted as a function of the binding energies. 2 In a series of pioneering studies, Nørskov and co-workers suggested a way to understand the chemisorption of reaction adsorbates in terms of the surface electronic structure of the materials in a socalled d-band theory. 3 Here, the essence is that the binding energy of an adsorbate to a metal surface is largely dependent on the electronic structure of the surface itself, namely, the d-band centre of the surface rather than the entire detailed density of states (DOS).
With the recent progress in electronic structure methods (mainly density functional theory calculations for solids) that can now give reliable electronic structures and binding energetics, the d-band center theory, along with the scaling relations that exist between the binding energies of related adsorbates, has been successfully applied to understanding and predicting new materials for many different applications. [4][5][6][7][8] However, exceptions were also found in which the usual d-band center trend could not explain the activity measured. [9][10][11] The main cause for the aforementioned exceptions was the lack of consideration of the spread in energy states and, for those cases, the correlation between the d-band center and the activity was improved by the introduction of the d-band width (W d ) 11 and the upper edge of the d-band (E u ) when using the Hilbert transform of the projected DOS. 10 It has also been suggested that the standard d-band model is not a reliable measure for systems such as the Pt-Au-M ternary nanoparticle system because of the notable change in the electronic structure caused by the strain and ligand effects compared to that for pure Pt nanoparticles. 12 Recently, instead of energetic descriptors, an alternative metric to describe the activity of the catalyst based on the local geometric features of the active sites has been proposed, namely, the generalized coordination number. 13,14 This approach yielded simple coordination-activity plots that predicted the optimal geometric structure of platinum nanoparticles, which were then experimentally veried. 15 Nonetheless, these generalized coordination numbers are not easy to extend to alloy systems, at least in their current form, since they cannot distinguish the different electronic structures associated with the different metal atoms in the alloys. In this sense, it would be helpful to have a descriptor that can describe the local coordination environment as well as the electronic structure of the constituent metal atoms when describing metal alloys. Indeed, an approach to satisfying the latter two aspects has been proposed in an orbital-wise coordination number 16 although it still requires ab initio calculated geometries to obtain high accuracy.
We note that an open database such as the Materials Project 17 provides an excellent general platform to perform large scale material screening for many applications using various DFT-derived quantities such as the density of states (DOS). In this work, we propose to use the d-band width within the (tight binding) linear muffin-tin orbital (LMTO or TB-LMTO) theory 18 as a simple non-ab initio quantity that can be efficiently used in rst-stage catalyst screening applications. Unlike the usual d-band width obtained from DFT calculations that is a bulk property of the slab, the LMTO d-band width in practice can capture the local electronic structure due to the truncation of the interatomic couplings up to the second nearest neighbor atoms.
Using this LMTO-based d-band width, we construct, as a toy problem, a chemisorption model to compute the binding energy of *CO on various metal alloys. The idea is to establish a functional relation between the simple yet analytical LMTO d-band width and *CO binding energy, and to perform largescale screening using these non-ab initio descriptors to nd a material with optimal *CO binding for an efficient CO 2 reduction reaction (CRR). Although there are many linear models between the descriptors such as the d-band center (sometimes augmented by the upper edge of the d-band) and the binding energy of the adsorbate, 10,11,19 to increase the prediction accuracy we adopt machine learning techniques to incorporate the potential nonlinear correlation between the descriptors and binding energies.
We note that there are two machine learning models to predict *CO binding energy in literature that inspired the present work and used simple descriptors, with 13 electronic structure based descriptors in one case 20 and 2 local geometric features in the other case. 21 The authors, in both reports, obtained a similar mean absolute deviation error of 0.13 eV when predicting the *CO binding energy for various alloy systems despite the very different input features (13 electronic vs. 2 geometric), demonstrating the importance of proper feature selection for improved learning efficiency. It can also be noted that the input features in both machines still required ab initio calculations to relax the geometries and/or to obtain accurate electronic structures of the materials. In a recent and very interesting reaction network study, 22 non-ab initio extended connectivity ngerprints (ECFPs) based on a Gaussian process (GP) model were used to predict the formation energies of $90 surface intermediate species, with a nal goal of identifying the most likely reaction pathways of syngas formation on rhodium (111), although a potential transferability limitation of ECFPs for different adsorption sites was noted in which ab initio calculations would still be needed. Finding non-ab initio input features that represent local environments is thus of signicant current interest in the eld of heterogeneous catalysis.
In this work, we propose the use of two non-ab initio input features, i.e. LMTO d-band width and electronegativity, as an easy-to-compute model to predict the *CO binding energy of various alloy systems. By combining the aforementioned descriptors and utilizing the latest active learning algorithms, we obtained a root mean square error (RMSE) of 0.05 eV. As an example of the application of the machine to screen transition metal based alloy catalysts for CRR, we identied three promising catalysts, Cu-Fe@Cu, Cu 3 Sc@Cu* and Cu 3 Y@Cu*, with higher activity than the most active Au based catalysts.

Non-ab initio descriptors for chemisorption
Selecting proper descriptors is one of the most important tasks in machine learning since it determines the learning efficiency as well as the prediction power. While most current descriptors for chemisorption models in machine learning require ab initio calculations, such as the d-band center and its higher-order moments, 21 our main focus is to utilize non-ab initio based descriptors. As proposed by Nørskov and co-workers, 3 the surface-adsorbate binding process can be decomposed into two effects; the coupling of the adsorbates with the sp-bands and d-bands of the catalyst surface. For the former, based on an empirical correlation between the sp-coupling and surfaceadsorbate bonding distance, 19,23-25 the sp-coupling term is usually estimated as the geometric mean of the electronegativity for the rst coordination shell c ¼ A and we followed the same practice. For the latter, instead of the conventional ab initio d-band center, we propose the use of a non-ab initio analytical expression of the d-band width within the LMTO theory, denoted W LMTO d (see eqn (S4) in the ESI †). While W LMTO d does not contain information about the center position of the d-band, there are two advantages of using it as a descriptor for the purpose of large-scale screening. Firstly, it effectively captures the local chemical environment of the d-state for chemical events due to the use of interatomic coupling terms with the second nearest neighbor atoms from the active site, a similar concept as used for the generalized coordination number. 13,14,26 Secondly, it does not require DFT calculations to generate input features since the analytical expression for W LMTO d can be evaluated based on tabulated values for a given composition (see ESI †). Since the learning will be supervised by labeled reference data, one would still need the DFT calculations of the *CO binding energy to establish a training set. However, we emphasize that, unlike existing ab initio input features such as the d-band center, the current model does not require additional DFT calculation to predict the *CO binding energy once the machine is trained.

Supervised (active) learning methods
Supervised machine learning is a method used to predict a target value (e.g. total energy) from given inputs (e.g. electron density). Among many algorithms, we used two machine learning methods, articial neural network (ANN or simply neural network, NN) and kernel-ridge-regression (KRR) methods. NN involves a stack of layers consisting of input, (multiple) hidden and output layers, and each layer contains many neurons. The network is trained by measuring and minimizing the errors between the predicted output and reference values (called labels) using backpropagation algorithms. 27 In KRR, the model is trained by solving the ridge regression with kernel-function-based non-linear transformation to input data. A kernel function is used to transform the original input data into an easy-to-train form by describing the relation (or distances) between the input data. 28 Excellent reviews introducing machine learning algorithms for additional details can be found elsewhere. 28,29 As briey introduced, in supervised machine learning, most of the computational cost of building the model usually occurs when generating the reference data in the training set and running the cost-function minimizations. Therefore, it is of signicant practical interest to reduce the training set size to as small as possible without compromising the representability of the system. Active learning is an algorithm in which the machine can point out samples with maximal information about the target function, 30 and it is widely used currently in classication/ltering, speech recognition, information extraction, computational biology, etc., for example. 31 In this work, we utilize active learning in the design of the catalyst to choose the minimal list of samples for training that can represent the given class of alloy material.
We used two types of machine learning methods, neural network (NN) and kernel-ridge-regression (KRR) methods, as described in detail in the Computational details section. For active neural network learning, 30 we used an ensemble NN model. In this method, one constructs multiple NN models in an ensemble (5 in our case) based on the same training set but optimized with different initializations, identies examples in the test set characterized by the largest variance (or ambiguity) within the ensemble and then includes these new examples in the next training set for further learning. Since this algorithm does not require a label (*CO binding energy), we denote this method an ensemble NN without labels. If one already has labels for the test set, improved accuracy might be expected by computing the actual residual, or error, (the difference between the ensemble-averaged model-predicted values and the true labels) and identifying an example with the largest residual. We denote this an ensemble NN with labels.
For KRR, since there are analytical unique solutions for given training samples and hyper-parameters, methods similar to ensemble NN cannot be constructed. Instead, there are other types of active learning algorithm for KRR in literature, [32][33][34][35] and in this work, we used the residual regression model. In this algorithm, one rst obtains a *CO binding energy predictor with the training set (as in conventional KRR) with a certain training set error. Next, one constructs an error predictor with the same training set using the previous training set error as output data. This error predictor is then used to identify the samples that are the most different from the existing training set. In other words, one estimates the errors of all the test samples using this error predictor and can nd samples with the lowest absolute value of the generated outputs for further learning. Since this algorithm does not need the labels of the test set, we denote this method active KRR without labels. 34 For a similar reason to that considered for ensemble NN with labels, since in the present case all of the labels of the test set are available, we also constructed another active KRR model utilizing the labels of the test set, denoted active KRR with labels. Here, one includes samples with the highest absolute value of the residual of the error predictor on the test set in the next training set. The residual of the error predictor is dened as the difference between the output of the error predictor and the error calculated from the *CO binding energy predictor.

Descriptor evaluations
As a toy problem of predicting the *CO binding energy on the fcc(100) slabs, we considered surface overlayers in the form of X@M, M-X@M and M 3 X@M, where M ¼ Ag, Au, Cu, Ni, Pd or Pt and X is the 3d, 4d or 5d transition metal (263 samples in total), as shown in Fig. 1 Table S3 of the ESI. † Counting of the rst and second nearest neighboring atoms was dened layer by layer using the two topmost metal layers. In other words, for the rst layer, around the binding site, there are 4 atoms of the nearest and 4 atoms of the second nearest neighbors on the basis of distance from the binding site. Similarly, this can be applied to the second layer; the number of the rst and second nearest neighboring atoms around the binding site in the second layer is 4 and 8. Using this denition as the coordination number, c was calculated on both the Mulliken (c M ) and Pauling (c P ) scales. The estimation of W LMTO d is described in detail in the ESI, † but we emphasize that  To implement the ensemble NN, 5 independent DHL ANN models were constructed with 4 samples randomly chosen as an initial training set. During the active learning process, 2 additional samples were identied in each iteration from the untrained samples and added to the training set until the nal training set reached 60% of the total data.
For all of the KRR models, the conventional KRR method (nonactive KRR) 29 with a kernel function, k(x u ,x v ) ¼ exp(Àkx u À x v k 2 /s), was used.
To reduce the computational cost of the hyper-parameter optimizations, we explicitly xed the kernel width as s ¼ 0.5 and the regularization factor to be 0.005. For the active KRR models, the same procedure as that of ensemble NN was applied. For all machine learning models, 4800 independent trials were applied to remove randomness from the initial training sets.

DFT calculations
We performed DFT calculations for selected candidates using the Vienna Ab initio Simulation Package (VASP) 41 with a projector-augmented wave (PAW) 42 and the revised Perdew-Burke-Ernzerhof (RPBE) exchange-correlation functional. 43,44 The energy cut-off for the plane-wave basis set was 500 eV, and k-points were sampled using a (8 Â 4 Â 1) Monkhorst-Pack mesh. 45 We modelled the fcc(100) slabs with a (4 Â 2) atom containing surface unit cell and 4 layers. A 15 A vacuum was used to minimize the interaction between periodic images in the z-direction. The topmost 2 layers were relaxed until the residual force on each atom became less than 0.05 eV A À1 . The free energy of the reaction intermediates on the surface was obtained by using a harmonic oscillator approximation at 298.15 K implemented using an Atomic Simulation Environment (ASE) program, 46 and the free energy of gas molecules was obtained using an ideal gas approximation at 298.15 K implemented using ASE. To correct the systematic DFT errors for describing C]O double bonds and H-H bonds, we added a +0.15 eV correction for each C]O bond, i.e. +0.30 eV for a CO 2 molecule and +0.15 eV for an adsorbed *COOH, and +0.10 eV for the H-H bond in a H 2 molecule. 47 We also applied approximate solvation corrections for *CO (À0.10 eV) and *COOH (À0.25 eV) to account for the effect of solvation. 48

Results and discussion
We rst compared the LMTO-based d-band widths, W LMTO d , to the DFT d-band widths, W cal d . Two main points can be drawn from the results shown in Fig. 2a Fig. 2b and c. Similar to the correlation between the DFT and LMTO-based d-band widths shown in Fig. 2a, there are also several studies showing that the electronic property from LMTO theory is strongly correlated to that from DFT calculation. As reported by A. Vojvodic et al., 11 the d-band center approximated by LMTO theory showed the same trend as that from DFT calculation. In addition, as reported by J. R. Kitchin et al., 49 it was theoretically shown that the matrix element, dened as the summation of the interatomic coupling terms up to all the nearest neighbors, from LMTO theory is strongly correlated to the d-band width from DFT calculations. All of these results suggest that the proposed d-band width using LMTO theory can reasonably represent the electronic properties of local environments.
Using these two selected descriptors, the performances of various machine learning models are summarized in Fig. 3. Three points are noteworthy: (1) Interestingly, even the LMTO-based d-band width (W LMTO d ) alone performs quite well (RMSE ¼ 0.07 eV). In addition, the LMTO-based d-band width consistently yields a lower RMSE than the ab initio-based d-band width by 0.05-0.15 eV, when combined with c. This suggests that the local concept involved in the evaluation of W LMTO d (interactions up to the 2 nd nearest neighboring atoms in the surface and subsurface layers) helps to correlate better with the binding affinity as compared to the bulk surface quantity (W cal d ). The localized nature of W LMTO d can also be conrmed by the data shown in Fig. 4, in which the core@M alloys with the same M species are all clustered around a similar region, whereas the distribution of W cal d is much broader for the same M species. This clustering is a helpful feature for active learning since it becomes easier to choose new data which differ the most from the existing training set data.
(2) For all of the combinations of descriptors shown in Fig. 3a, KRR (0.05-0.37 eV) performs consistently better than ANN (0.21-0.45 eV) for the chemistry studied.
(3) The actively learned KRR enhances the accuracy of the model signicantly, lowering the RMSE by 0.13 eV compared to that of conventional KRR (from 0.18 to 0.05 eV) for the best case. A more detailed discussion on the effects of active learning on the RMSE variance and accuracy for both ANN and KRR will be given later.
Therefore, by combining all of these results, the best chemisorption model without any ab initio inputs is the active KRR model based on the pair of descriptors (W LMTO d and c P ) with an RMSE of 0.05 eV. This can be compared with previous results (0.13 eV) using ANN with ab initio based parameters and geometries.
There are many indications that the d-band center alone is not a sufficient descriptor for more complicated catalyst structures, [9][10][11][12] but the d-band center is still one of the most widelyused descriptors for chemisorption models on the catalyst surface, so we also considered models that included the conventional d-band center (Fig. 3b and c). As expected, for all of the machine learning methods and combinations of descriptors, the inclusion of the d-band center improves the accuracy signicantly, with the best model being the active KRR with any combination of descriptors with RMSE $ 0.02 eV. It is remarkable, however, to note that the difference between the cost-effective LMTO d-band width model (0.05 eV) and the expensive d-band center model is only 0.03 eV.
Because the current model is trained using a single type of active site (fcc(100)-terrace), we tested the extensibility of the approach by treating two different additional coordination environments, namely fcc(111)-terrace and fcc(211)-step sites. For each fcc surface type, 3 kinds of surface slab model (X@M, M-X@M and M 3 X@M where M, X ¼ Ni, Cu, Pd, Ag, Pt or Au) were considered. We constructed the machine based on the active KRR with labels model with a pair of descriptors (W LMTO d and c P ) and a total of 305 samples. We obtained an overall combined RMS error of 0.08 eV, quite close to the 0.05 eV Fig. 3 The performance of various machine learning models with different descriptors: (a) without an ab initio d-band center, and (b and c) with a d-band center. All RMSE values were calculated for the entire data set. obtained with the (100) facet only, as shown in Fig. 3a. Apparently, the latter subset is not an extensive test of the method, yet it clearly shows the reasonable potential of the d-band width as an efficient descriptor for the rst-hand screening of large candidates before DFT calculation on various coordination environments. Next, we systematically analyzed the results with and without active learning techniques, which are summarized in Fig. 5 with the raw RMSE data of 4800 independent trials for ANN and KRR methods. For ANN, it is clear that the widths of the distributions are substantially reduced by applying an active learning algorithm from 0.31 eV for DHL ANN to 0.07 eV for ensemble NN (w. labels). Similar behavior is also seen in KRR, in which the width of distribution decreased from 0.17 eV for non-active KRR to 0.01 eV for active learning with labels. Interestingly, in the absence of the labels, the width increased with active learning (albeit still with improved nal accuracy). Although, as shown in Fig. 5c, the effects of active learning are much more pronounced for KRR (0.18 / 0.05 eV) than for ANN (0.21 / 0.17 eV), it is possible that the use of different active learning algorithms for ANN other than the ensemble method used here may further enhance the resulting accuracy.
As a practical application of the actively learned chemisorption model with W LMTO d and c P as descriptors, we screened over 372 transition metal-based alloys (including 263 structures used in learning) with the structures shown in Fig. 1 to nd active CRR catalysts to produce CO. Particularly, it has been suggested based on DFT calculations that the *CO binding energy is a key descriptor for the catalytic activity of CO 2 reduction, 48 and the current density measurements on various metal catalysts indeed showed a volcano-shaped relation of activity with respect to *CO binding energy. 50 Currently, Au catalysts are reported to be the best single component catalyst for converting CO 2 into CO, but alternative cost-effective catalysts are needed due to the high cost and scarcity of Au. To replace Au catalysts, one thus needs to develop catalysts with strong E COOH to facilitate the activation of CO 2 , but not too strong E CO so as to easily remove the product. Considering that the optimal *CO binding energy to achieve facile *COOH formation and *CO desorption is approximately À0.5 eV based on the scaling relation and the volcano plot, 48 we selected candidates for which the *CO binding energies are in the range of À0.60 to À0.43 eV. 20,48 As shown in Fig. 6a, our actively learned chemisorption model yielded 36 candidates within the optimal target window (À0.60 to À0.43 eV). Among them, we chose the alloys in which the outermost surface layer is covered by Cu or Au and that are nearest to the top of the volcano plot, yielding 15 candidates for further validation with good agreement between E CO,DFT and E CO,ML (see Fig. 6b). However, under experimental conditions complicated segregation/dissolution processes and mixing can occur in nanoalloys suggesting the importance of stability in catalyst design. 51 In addition, the catalytic activity can even be enhanced by a change in surface metal composition, which could possibly be caused by favorable surface segregation under reaction conditions. For example, K. J. Andersson et al. 52 experimentally investigated the surface segregation of a CuPt near-surface alloy (NSA) under CO adsorbed conditions. This indicates two main points: rstly, under an adsorption environment the electronic structure differs from vacuum conditions, and secondly, counterintuitively one can design a new type of alloy by inducing adsorption-driven surface segregation. In this context, we explored the stability of all 15 candidates under vacuum and a CO adsorbed state (details are shown in Fig. S1 †) by changing the surface and subsurface layer. Herein, we use the symbol * for samples which are stable when the surface and subsurface layers are switched from the original structure shown in Fig. 1. Fig. 7 shows the free energy diagram for selected catalysts (full results are found in Fig. S2 †) including Au(100) and Cu(100) as references, and all of the selected catalysts show stabilized *COOH and *CO compared to Au(100) and Cu(100). Considering the limiting potential (U L ¼ ÀDG MAX /e) as a measure of CRR activity, the free energy diagram indicates  that the U L of Cu-Fe@Cu (À0.85 V) is less negative than that of Au(100) (À1.21 V) by 0.36 V. Furthermore, Cu 3 Y@Cu* and Cu 3 Sc@Cu* have a U L of À0.20 V and À0.35 V respectively, which is less negative than Au(100) by 1.01 V and 0.86 V respectively. The catalytic activity of Cu 3 Y@Cu* is expected to outperform various Au-based catalysts, such as a Au-Cu bifunctional interfacial catalyst (U L ¼ À0.60 V) 53 and Au NP corner site (U L ¼ À0.60 V). 54 These results should also be compared to the experimental potentials of the best performing Au-based catalysts that reach a current density of CO production of more than 5 mA cm À2 in literature; À0.40 V for oxide-derived Au nanoparticles, 55 À0.35 V for Au needles 56 and À0.35 V for Au nanowires. 57 All of these results imply that Cu 3 Y@Cu* could be highly active, comparable to the Au catalysts, and a costeffective alternative to the Au catalysts for the CO 2 reduction reaction.

Conclusions
We presented a machine learning model that can predict the binding energy of surface adsorbates on alloys using simple non-ab initio input features, namely, linear muffin-tin orbital theory (LMTO)-based d-band width and a geometric mean of electronegativity. By combining the aforementioned descriptors with the active learning algorithm, we obtained a high accuracy (RMSE ¼ 0.05 eV for active KRR with labels) to predict *CO binding energy. The use of the LMTO d-band width as a learning descriptor yielded a higher prediction accuracy than the DFT-based d-band width due to the local characteristics of the W LMTO d . The effects of active learning were signicant, lowering the RMSE for a neural network 0.21 / 0.17 eV, and for KRR 0.18 / 0.05 eV. The present d-band width model is also shown to work reasonably well for other facets such as (111) and (211) step sites to describe different coordination environments (with a combined error of 0.08 eV for all three facets). As an example of the practical application of the constructed KRR machine, we then screened the alloy catalysts for the CO 2 electro-chemical reduction reaction by estimating their *CO binding energies, and identied that Cu 3 Sc@Cu* and Cu 3 Y@Cu* have an overpotential of $1 V lower than Au(100). We expect that the non-ab initio descriptors proposed here can be easily applicable to other types of catalyst designs, such as understanding the statistical behaviour of realistic experimental nanoparticles or nanowires with long temporal and large spatial sampling aspects where there are thousands of possible reaction sites with different local environments. 58 Being able to rapidly estimate *CO binding energies using the easy-to-compute input features proposed here will undoubtedly be helpful to provide new insights for exciting experimental CO 2 reduction results on complex surfaces.

Conflicts of interest
There are no conicts to declare.