A data-driven high-throughput workflow applied to promoted In-oxide catalysts for CO 2 hydrogenation to methanol †

We propose a novel high-throughput workflow, combining DFT-derived atomic scale interaction parameters with experimental data to identify key performance-related descriptors in a CO 2 to methanol reaction, for In-based catalysts. Utilizing advanced machine learning algorithms suitable for small datasets, secondary descriptors with high predictive power for catalytic activity were constructed. These descriptors, which highlight the crucial role of hydroxyl sites, can be applied to designing new materials and to bringing them to the test with high-throughput screening, paving the path for accelerated catalyst design. The environmental impact of CO 2 emissions calls for urgent action towards converting this environmentally detrimental waste product into carbon feedstock and recycling it to valuable products. 1,2 Achieving this objective requires catalysts that are both inexpensive to produce and highly efficient in given process scenarios. The high thermodynamic stability of CO 2 presents a significant hurdle in its utilization, which can only be overcome by lowering the high reaction barrier through the use of suitable catalyst materials. 3 While alternative CO 2 activation methods do exist, 4,5 conventional, thermal heterogeneous catalysis is a preferred method in industrial applications. However, the process of designing new catalysts is traditionally decelerated by the circular dependency between catalyst testing, comprehensive characterization, and computational

The environmental impact of CO 2 emissions calls for urgent action towards converting this environmentally detrimental waste product into carbon feedstock and recycling it to valuable products. 1,2Achieving this objective requires catalysts that are both inexpensive to produce and highly efficient in given process scenarios.The high thermodynamic stability of CO 2 presents a significant hurdle in its utilization, which can only be overcome by lowering the high reaction barrier through the use of suitable catalyst materials. 3While alternative CO 2 activation methods do exist, 4,5 conventional, thermal heterogeneous catalysis is a preferred method in industrial applications.However, the process of designing new catalysts is traditionally decelerated by the circular dependency between catalyst testing, comprehensive characterization, and computational modelling.Experimentally observed performance trends from experimental data can provide insight into catalytic activity and allow for the traditional design of experiments based on simple composition changes and major reaction parameters.
But when it comes to designing new catalyst classes, data based on experiments alone may have certain limitations. 6,7n the other hand, in-depth characterization of individual catalysts requires a significant investment of resources, and may under certain conditions be limiting.The advent of quantum-mechanics (QM) methods such as density functional theory (DFT) in the late 1990s has helped to increase the speed of discovering new materials via revealing structure-activity relationships. 8However, one of the main challenges in employing QM methods in screening processes for new catalysts is the high computational cost. 9Traditional simulation protocols involve a systematic exploration of reaction mechanisms, which can be tedious and require experienced modellers' careful interventions, making them challenging for high-throughput screening applications.Widely studied reactions with community-accepted mechanisms, such as the one being discussed here, in principle, allow for identification of crucial rate-limiting steps referring to earlier work.However, it would be desirable to replace the need for very expensive, accurately predetermined reaction steps with a high-throughput simulation-friendly, data-driven protocol.Such a protocol should consist of programmatically generated, chemically relevant reaction intermediate adsorption energies on the desired surface-active site.These "generic" descriptors combined with relevant experiment parameters can then be correlated with target response parameters to identify the design rules with key important parameters. 10,11These design rules hold the promise of reducing the number of configurations that need to be considered, and in turn, reducing the computational cost and increasing the speed of prediction.
In this study, we propose a workflow that combines highthroughput computation and experimentation, along with machine learning algorithms (as shown in Fig. 1).With this workflow, we identify descriptors and rules that have the desired predictive power for the potential development of new catalytic materials.It is important to note that this protocol does not require prior knowledge of curated reaction steps; only a general knowledge of possible intermediates is sufficient.This makes the protocol suitable for new catalyst compositions where the reaction mechanisms have not been experimentally proven, and data-driven identification of important intermediates that can provide more insight quickly on the hypothesized mechanism.The desired target product in the study discussed here is methanol, which can be used as an alternative clean fuel and intermediate for the production of a variety of chemical commodities. 12,13e focus our attention on In 2 O 3 /ZrO 2 as a benchmark system for CO 2 hydrogenation.In 2 O 3 /ZrO 2 catalysts have gained attention in recent years due to the enhanced stability under CO 2 -rich conditions compared to extensively studied Cu-based catalysts, 14,15 making In 2 O 3 /ZrO 2 a promising candidate for industrial application.To diversify the chemical space sampled in this study, we employ a simple promotion strategy in which 13 different promoters from transition and post-transition metals were selected to synthesize Inmodified catalysts via co-impregnation (see Fig. 1).A molar ratio of promoter : In = 1 : 3 was used.To synthesize the catalysts, ZrO 2 was initially crushed and sieved to reach a target fraction of 250-315 μm.For impregnation of In, as well as co-impregnation of each of the 13 co-promoters, the crushed and sieved ZrO 2 was added to an aqueous nitrate solution of In and the respective co-promoter.Subsequently, the solvent was evaporated and the catalyst was calcined at 300 °C.
The catalytic test was carried out in the gas phase in a 16fold parallel reactor system at hte GmbH (Heidelberg, Germany).The reactors were filled with 0.5 ml of calcined catalysts, with a pre-/post-bed of corundum (Fig. 1).The catalysts in each reactor channel were reduced under H 2 : N 2 : Ar = 30 : 60 : 10, and afterwards tested for CO 2 hydrogenation to methanol.The reaction was conducted under a pressure = 80 bar, total flow = 48 L h −1 , CO : CO 2 : H 2 = 1.9 : 17.1 : 76 and three different temperatures: 225, 250 and 275 °C.The yields were measured using in-line TCD-GC.Only data in the kinetic regime outside of the thermodynamic equilibrium limits were considered, and therefore higher temperatures above 275 °C were not included in the screening.
To develop a descriptive model for catalytic performance, one should consider the behaviour of the activated catalyst, which is formed in situ and under reaction conditions.7][18] Therefore, obtaining a functional relation between performance and computed descriptors stemming from a simplistic yet representative DFT model is extremely valuable.To this extent, we construct an oxygen vacancy on the (110) surface of cubic In 2 O 3 and introduce the chemical impact of the promoter species by replacing one of the In atoms at the O vacancy by the promoter (Fig. 1 and S1 †).This site is probed by all intermediates, along two competing pathways for CO and methanol formation 15 (Fig. S2 †).In total, 1350 intermediate relaxations were performed on the DFT level (Fig. S3-S6 †).Thereby, adsorption energies of 90 unique adsorbates, as well as formation energies of oxygen vacancies on two different adjacent sites with respect to the promoter site, were computed for each of the 14 catalysts, resulting in a total of 92 DFT-obtained descriptors.The energies of many expected intermediates along the investigated CO or methanol formation pathways are correlated to each other (Fig. S7-S11 †).A bivariate correlation matrix was generated to detect descriptor pairs whose absolute value of the Pearson coefficient was above 0.9.Based on this correlation analysis, a reduced subset of 27 DFT descriptors (from the total of 92), which are not correlated with each other, were selected for the model generation.
Many of the DFT-derived descriptors are expected to have an impact on the catalyst's performance, some more others.These correlations may be linear, but are not limited to that.To identify the key descriptors and account for nonlinear correlations, we use the SISSO (sure independence screening and sparsifying operator) algorithm as introduced by Ouyang et al. 19 to develop a predictive model for the methanol yield.1][22][23] The employed primary features for our study consist of 27 DFTobtained descriptors and the reaction temperature T. We use the term secondary descriptor to refer to any non-linear combination of primary features, generated by SIS (sure independence screening).
The model complexity is determined by two parameters: the number of non-zero coefficients or dimension, and the number of allowed mathematical operations to create secondary descriptors.For the sake of brevity, we use D and O to refer to the dimension and number of allowed operations, respectively.Considering the high sensitivity of ML models to data scaling, prior to determining the optimum model complexity, various methods for preprocessing of the primary features were employed to obtain the most suitable scaling method in terms of accuracy.We report four different methods, which were used to generate SISSO models with a reference complexity of 3D,3O, and use the root mean square error (RMSE) for each method to assess their performance (see Table 1).For numerical stability, all methods employed a common step of applying a shift of E min − E 0 to all DFT-computed values, where E min is the lowest adsorption energy among calculated values, and E 0 is a small positive value (0.1 eV for all DFT-computed descriptors) to ensure that there are no zero values in the dataset.Afterwards, the following methods were employed one at a time to scale all primary features, i.e.DFT-obtained and T: (i) scale values for each primary feature between 0 and 1, (ii) divide values for each primary feature by the smallest value, (iii) replace T with log(T), (iv) remove T from primary features, and divide all DFT-obtained values by T/T 0 , where T 0 is the lowest reaction temperature, 225 °C.
As shown in Table 1, the first pre-processing method yields the lowest RMSE for both test and train sets, and is therefore selected for creating the SISSO model.After scaling the primary features, the optimum model complexity should be determined.The accuracy of the trained model is expected to increase with complexity.However, overfitting by highly complex models must be avoided to facilitate generalization of the model to validation data, i.e. data not used in training.To establish the optimum trade-off between model accuracy and generalization ability, we adapt the leave-one-group-out cross validation procedure. 24In our work, we consider 16 levels of model complexity; from 1D,1O to 4D,4O.For each complexity level, a leave-one-group-out cross validation 25 is carried out, where each group is represented by one of the 14 tested catalysts.This means that for each cross validation, a model is trained with data from 13 catalysts, and tested with the unseen data from the 14th catalyst.Thereby, the predictive power of each trained model is tested with unseen data from a catalyst with a completely different composition than that of the training set.The overall cross validation performance is evaluated by averaging the regression coefficient R 2 across the trained models, which is a widely accepted method for evaluating the accuracy of ML models with small datasets. 26,27he cross-validation results are shown in Fig. 2a.As expected, the highest complexity (4D,4O) gives the highest average R 2 of 0.95.But the respective RMSE values (Fig. S12 †) suggest that a model with D = 3 and O = 1 is optimal for avoiding overfitting and yields the lowest average RMSE for the test set, which is selected for generation of the SISSO model in this study, which is shown in eqn (1): where c 1 -c 4 are fitting constants, T is the reaction temperature, and A, B, C, D and E each represent a primary DFT feature, as shown in Table 2 and Fig. 2d.The scaling of primary features between 0 and 1 allows for linking the values of fitting constants c 1 -c 3 to the importance of the respective SIS-constructed descriptors.The first term in eqn (1) which includes the temperature, has the highest weight (c 1 = 49.48),meaning that the employed SISSO algorithm follows the general expectation of higher yield at elevated temperature.This highlights the relevance of test conditions to performance optimization.Despite having a significant impact on the performance, the reaction temperature should not be vastly varied to enhance the performance, because elevated temperatures, apart from Catalysis Science & Technology Communication reaching thermodynamic limits, may also increase the risk of structural changes that degrade the catalyst. 28A valid indicator of a potentially promising catalyst candidate may be the ability to perform well under industrially relevant conditions close to conventional methanol process conditions. 29,30In this study, the registered performance data were chosen in a way that the data used were not recorded under thermodynamic equilibrium conditions and test parameters are chosen accordingly.
From an atomistic perspective, eqn (1) shows us the desirable properties of a high-performing catalytic site.For the DFT-derived transformed and scaled descriptors, low values imply strong adsorption and vice versa.We notice that strongly bound but not activated linear CO 2 (Fig. 2d A) negatively impacts performance.The other contributing factor is the optimal balance between the propensity of the site to bind an OH group, next to a CO (CO 2 ) molecule, and a solely bound CO adsorbed at the active site.The key to improving performance is improving the affinity towards the substrate with an OH group in close proximity (Fig. 2d B and D) in relation to the substrate bound alone (Fig. 2d C and E).This can be interpreted as a leaving group from the substrate or H atom to be transferred to the substrate backbone.
The primary and secondary features from eqn (1) are used as potential descriptive parameters to establish design criteria for increased catalytic performance.To accomplish this, we used the subgroup-discovery (SGD) algorithm 6,31,32 which is designed to identify distinct subgroups in the data and describe most exceptional subgroup through the most relevant descriptive parameters from the provided set of parameters.The algorithm detects several competing subgroups, which are evaluated based on a quality function (see ESI † Methods-2 for more information).
The SGD results reconfirm that the reaction temperature and the SIS-constructed descriptors B/C and D/E are the key features for describing outstanding performance.2c).The catalysts that form the subgroup and are located in this area are In-Pt, In-Pd, and In-Rh, while the catalyst that is furthest away from the detected subgroup area is In-Sn,

Conclusions
We propose a workflow that utilizes atomic scale interaction parameters derived from high-throughput computational simulations to supplement the information available from high-throughput experimental data.This allows for a more comprehensive understanding of the catalyst optimization process going beyond traditional design of experiment methods.Our machine learning models trained with this enriched input descriptor space have the potential to predict the performance of new catalysts, such as the unseen Pt-In system, based on the training on other promoted systems (e.g.Rh, Pd, Nb, etc.) alone.Additionally, the workflow also opens prospects to further reduce future computational cost by identifying the key descriptors that have the highest predictive power for a given catalytic performance, and simultaneously construct design criteria for new catalysts.By applying this workflow to extensively studied indium based catalysts, we show that the design rules obtained align well with our understanding of chemical principles and highlight the influence of hydroxyl sites on the catalytic activity.The machine learning algorithms (SISSO and SGD) used in this study also offer the advantage of explainability, making them useful in catalyst applications, where experimental data are often limited.The results of this study demonstrate the potential of the workflow to identify descriptors and rules with sufficiently high predictive power for the potential development of new catalyst materials, making it a valuable tool for addressing the needs of a more modern and agile approach to industrial catalyst and process development.
Fig. 2c depicts the detected constraints for B/C ≤ 1.04 and D/E > 1.58 with hatched lines.The color map reflects the obtained values from eqn (1) for T = 250 °C and A = A In , i.e. the respective value for the In reference catalyst.The variation ranges for B/C and D/E cover the minimum and maximum for each descriptor across the entire dataset.The colour of each scatter point in this figure reflects the measured value for the respective catalyst at the same temperature.The prediction and measurement show good agreement, which is particularly relevant in the area detected by SGD (Fig.

Fig. 2
Fig. 2 a) Evaluation of trained models after cross validation for different model complexities.b) Model accuracy for a selected complexity of 3D and 1O.c) The hatched lines show the detected constraints on SIS-constructed descriptors for outstanding subgroups.The color bar on the right shows the respective values for Y methanol .d) Relaxed structures of intermediates identified by SISSO.A Pt promoted system is shown as an example.Carbon, oxygen and hydrogen atoms are shown as gray, red and white spheres respectively.Letters A-E are used to refer to each relaxed configuration (see Table2).
Fig. 2 a) Evaluation of trained models after cross validation for different model complexities.b) Model accuracy for a selected complexity of 3D and 1O.c) The hatched lines show the detected constraints on SIS-constructed descriptors for outstanding subgroups.The color bar on the right shows the respective values for Y methanol .d) Relaxed structures of intermediates identified by SISSO.A Pt promoted system is shown as an example.Carbon, oxygen and hydrogen atoms are shown as gray, red and white spheres respectively.Letters A-E are used to refer to each relaxed configuration (see Table2).

Table 1
Obtained RMSE for train and test datasets for each preprocessing method, as described in the text.All values are obtained for a SISSO model with 3D,3O

Table 2
Values for the respective constants and descriptors in eqn (1).E ads = adsorption energy.For parameters A-E, the respective configuration from Fig. 2d is mentioned in brackets This journal is © The Royal of Chemistry 2023 which in fact shows the lowest measured yield among tested catalysts.The overlay of SISSO-generated values (shown by the colour map) and SGD-detected area (shown by hatched lines) highlights the ability of the employed workflow in autonomous prediction of exceptional performance.The developed design criteria in Fig. 2c can be used to explore new catalytic materials by computing the values for SIS-