Amitabha
Das‡
a,
Diptendu
Roy‡
a,
Shyama Charan
Mandal§
ab and
Biswarup
Pathak
*a
aDepartment of Chemistry, Indian Institute of Technology Indore, Indore 453552, India. E-mail: biswarup@iiti.ac.in
bSUNCAT Center for Interface Science and Catalysis, SLAC National Accelerator Laboratory, 2575 Sand Hill Road, Menlo Park, CA 94025, USA
First published on 18th March 2024
The CO2 reduction reaction is a promising way to reduce the CO2 level in the environment and most importantly to produce C1-based chemicals (HCOOH and CH3OH) that can be used as liquid fuels and industrial chemicals. In this regard, homogeneous Mn-PNP based catalysts are found to be suitable for the reaction and show promising potential for enhancing activity through ligand modification. Herein, a novel ligand encoded feature matrix enabled machine learning (ML) model has been developed to screen efficient catalysts from a large search space of earth-abundant aromatic Mn-PNP catalysts using the effects of different ligands present on the different ligand sphere of the catalysts. The ML models based on gradient boosting (GBR and XGBR) were found to be the best performing ML models with a density functional theory (DFT) level of accuracy. Potential catalysts for HCOOH and CH3OH formation are identified based on the overall reaction free energy barrier through ML + DFT. The importance of different regions (R1 and R2) and the effect of ligand substituents (+I/−I) in the catalyst are unleashed. Furthermore, a favorable mechanism to produce HCOOH has been ascertained.
It was reported that a subtle change in the ligand sphere can change the overall activity of the catalyst.12,15,16 Similarly, the electronic environment of the ligand has a huge influence on the catalyst activity. Thus, the activity of any catalyst can be significantly enhanced through ligand substitution. The various possibilities in the unknown ligand space are then confronted with the challenge of choice from the vast structural possibilities. As a result, there has been a long-standing interest in predicting a given ligand's likely impact on the structure and reactivity of organometallic complexes through parameterization. In the absence of insight into the correlation between ligand and catalyst, new catalysts are developed mostly through a series of tedious trial and error cycles guided by chemical intuition either using density functional theory (DFT) methods or experiments. It takes a lot of time, money, and effort to complete the process. The possibilities of catalyst combinations are vast considering the available ligand space. Therefore, the existing strategies clearly fail to address the full scope of the challenge. This raises the importance of developing faster, efficient, and more reliable data-driven catalyst design techniques to guide future experiments. In this regard, machine learning (ML)-based regression techniques have emerged as a promising tool.17–19 Through specific strategies, ML algorithms are able to learn underlying principles behind vast amounts of data through mathematical theory and rigorous data analysis. These laws are used to achieve the purpose of prediction. Thus, the active catalyst search for the reaction can be accelerated using ML models. Furthermore, the activity of the catalyst or any chemical transformations are evaluated through the activation barriers. For the ML prediction of the activation barriers, the mostly used features are the DFT calculated features, which are in a way time consuming and computationally expensive.20–22 Moreover, some features suffer due to the lack of interpretation. Thus, it is necessary to develop a feature that does not require DFT calculations and can alone predict the activation energy barrier.
Herein, under the assumption of the great influence of the ligand in the aromatic Mn-PNP catalysts for CO2 conversion to C1 products (HCOOH and CH3OH), we set out to identify the best set of ligands and overall catalysts for the reactions. To accomplish this, we used regression models using ligand encoded matrix-based features for all the considered ligands and further tested the reliability of the best fitted ML models through cross-validation (CV) analysis and DFT calculated results.
Natural bond orbital (NBO) analysis is done to understand the charge distribution of different atoms.35
The consideration of the DFT calculated features would take a great deal of time and resources for such a large number of possible catalysts. Therefore, we introduced a ligand encoded matrix-based feature for simplicity and better adaptability of the model like the microstructural feature introduced in heterogeneous catalysis.44,45 The newly introduced matrix-based feature can speed up the process of training and prediction as the matrix contains only the information about variable parts i.e., ligand substituents. This feature is obtained for each catalyst through a ligand encoding technique considering the kind of ligand and the number of specific ligands present on different substitution sites R1 (L1 and L2) and R2 (L3) in the catalysts. These variants of substitution sites are decided according to the distance of the ligands from the metal centre. An illustration of the feature matrix with a few catalysts is shown in Fig. 2. Furthermore, each catalyst's individual matrices are combined into a large matrix capable of containing all the structural information. A total of 16 features (Fig. 2) have been considered and these features are the different ligands present in a specific region of the catalysts. In catalyst 1 (Fig. 2), two Et and two CF3 groups are present in the R1 region, and one H is present in the R2 region and therefore the feature values for these features are 2, 2 and 1, respectively and the rest of the feature values are 0 in the matrix for the catalyst. In this way, a feature matrix has been prepared for 330 catalysts for our machine learning study. The newly introduced feature matrix can reduce the time and cost of feature selection and feature engineering.
Two different reaction mechanisms such as Noyori and revised Noyori (RN) are reported for the conversion of CO2 to HCOOH (Scheme 1a).4,24 The subsequent step involves the reaction between the HCOOH and the co-catalyst (morpholine) to form N-formylmorpholine.10,24 There are two competing pathways for the hydrogenation of N-formylmorpholine and we have considered the most favorable mechanism for the formation of CH3OH (CO bond hydrogenation followed by C–N bond; Scheme 1b).24,46 There are three important transition states for the formation of HCOOH and CH3OH from CO2, namely hydride transfer to CO2 (TS1), heterolytic H2 cleavage to regenerate the active catalyst (TS2) and heterolytic H2 cleavage to transfer a proton (TS3) to HCOO− (Scheme 1a). In the RN mechanism, either TS1 or TS3 can contribute to the overall reaction free energy barrier (ΔΔG‡RN),8,24,47 whereas TS2 is the most important step that controls the overall reaction free energy barrier of the Noyori type mechanism (ΔΔG‡N).3,7 Similarly, CH3OH can be produced through the indirect mechanism (free energy barrier: ΔΔG‡M), where TS2 is the most important transition state.3,24 Therefore, in all the cases, the calculated ΔΔG‡ for (Fig. 1d) each individual mechanism using the transition state model is considered to be a suitable descriptor of the overall reactions (ΔΔG‡RN, ΔΔG‡N and ΔΔG‡M). Additionally, the activation barrier (ΔG‡) for each of these transition states (TS1: ΔG‡TS1; TS2: ΔG‡TS2; and TS3: ΔG‡TS3) can also be a suitable descriptor to identify the activity of the catalyst for a specific step of the reaction.
Scheme 1 Possible CO2 hydrogenation mechanisms to (a) produce HCOOH and (b) the hydrogenation of N-formylmorpholine to CH3OH with an aromatic Mn-PNP based catalyst. |
Through DFT calculations, a training dataset (44 catalysts) was generated with proper sampling of the catalysts to make sure all the ligand substituents are present in both the regions of the catalysts (Table S1, ESI†). Herein, several regression models were used to achieve all the considered output descriptors. Initially considered ML models were trained and evaluated using an 80:20 train-test split of the DFT calculated datasets.45,48 The workflow is summarized in Fig. 1b. The performance of each regression mode is estimated with train-test root-mean-square error (RMSE) to avoid the overfitting condition.
Five ML models were implemented, namely ordinary linear regressor, kernel ridge regressor (KRR: kernel-based algorithm), Random-forest regressor (RFR: tree-based algorithm made of bagging technique) implemented in the scikit-learn package49 and gradient boosting frameworks such as gradient boosting regressor (GBR) and eXtreme gradient boosting regressor (XGBR).50 The NumPy and Pandas libraries were used for the mathematical functions and data pre-processing, respectively, along with Matplotlib and Seaborn for the plotting of numerical data and visualization.51–53 Through the hyperparameter tuning of all the algorithms, we achieved the best possible values of the hyperparameters, resulting in optimized algorithms (Table S2, ESI†). The performance of all the considered models is given in Table S3 (ESI†). The best performing ML model for each descriptor was chosen based on the lowest train-test errors. We have found that the optimized XGBR has the lowest train-test RMSE for both ΔΔG‡RN and ΔΔG‡M and the GBR model for ΔΔG‡N (Table 1). The plots of DFT calculated versus predicted ΔΔG‡ are shown in Fig. 3. Similarly, for other descriptors (ΔG‡TS1, ΔG‡TS2 and ΔG‡TS3), either of the gradient boosting regressors (GBR or XGBR) performed well with the least test RMSE (Table 1 and Fig. S1, ESI†). All the train-test RMSEs for the best fitted models are reasonable enough to be considered for suitable predictions of unknown catalysts. In the case of ΔΔG‡RN, the calculated RMSE is slightly higher compared to the other cases, underlying the fact that either TS1 or TS3 can contribute in the ΔΔG‡RN and both of these transition states are chemically different from each other. In contrast, for the individual case of the ΔG‡TS1 and ΔG‡TS3 the test RMSE is significantly low (Table 1). Taking all these results into account, it appears that the learning complexity rises to the ML model for the training data of descriptor ΔΔG‡RN, resulting in a slightly higher RMSE.
Name | ΔΔG‡N | ΔΔG‡RN | ΔΔG‡M | ΔG‡TS1 | ΔG‡TS2 | ΔG‡TS3 |
---|---|---|---|---|---|---|
Method | GBR | XGBR | XGBR | XGBR | GBR | GBR |
Train (RMSE) | 0.01 | 0.08 | 0.03 | 0.04 | 0.01 | 0.01 |
Test (RMSE) | 0.09 | 0.14 | 0.12 | 0.07 | 0.10 | 0.08 |
Fig. 3 Plot of DFT calculated vs. ML predicted ΔΔG‡ and feature importance for (a) Noyori, (b) revised Noyori and (c) CH3OH formation mechanisms. |
To ensure the stability and generalizability of the best fitted ML models, we have performed 5-fold CV analysis (Fig. S2, ESI†). The average RMSE of CV in each case is close to the test RMSE (Fig. 4a and Table S4, ESI†). This reflects that all the best fitted ML models are suitable and generalizable even though the training dataset is small. The reliable performance of the ML models could be due to the proper sampling of all the data points to prepare the training dataset.54,55 Considering the best fitted ML models for individual descriptors with optimized hyperparameters, all the descriptors (overall reaction free energy barriers (ΔΔG‡) and activation barriers (ΔG‡) of individual transition states) were predicted for the rest of the 286 catalysts (Table S5, ESI†). Further validation of our predictions is also carried out with 5 arbitrary catalysts (Fig. S3, ESI†) from the predicted data points calculating all the considered output descriptors using the DFT method. Interestingly, it was observed that the predicted values closely matched with the DFT calculated values (Fig. 4b and Table S6, ESI†), which again suggests that all the best fitted models are stable and suitable for prediction.
Fig. 4 (a) 5-fold CV analysis with the best fitted ML models for all the descriptors and the Avg. is the mean RMSE of the CV, and (b) comparison of DFT calculated vs. ML predicted ΔΔG‡ of all the catalysts considered in Fig. S3 (ESI†). |
Based on DFT calculation and ML prediction, the RN type mechanism was found to be the favourable pathway over the Noyori type mechanism for aromatic Mn-PNP catalysts to produce HCOOH.3,24 The TS2 accounted for the overall high ΔΔG‡N for the Noyori-type mechanism, thus making the mechanism not suitable for HCOOH formation (Scheme 1a and Table S1, ESI†). In contrast, in the RN-type mechanism, the barriers for both TS1 and TS3 are low (Scheme 1a and Table S1, ESI†), thereby making the pathway more favourable for HCOOH formation.
The permutation feature importance analysis (Fig. 3) shows the contribution of different ligands present on different positions of the catalysts towards ΔΔG‡ for the considered reactions.54 The ligands present in the first region have higher contributions as compared to the second region, which shows that the former is the most important part to tune the activity of aromatic Mn-PNP for a specific reaction. Ligands having a stronger/moderate −I effect are found to adversely affect the activation barriers of the reactions (Table S1, ESI†). Conversely, groups with +I effects are observed to be beneficial for the reactions. Using DFT+ML, we have identified the top-performing catalysts for the formation of CH3OH and HCOOH from CO2 (Fig. 5 and Fig. S4, ESI†). For the HCOOH formation, along with the three top performing catalysts (1 to 3) (Fig. S4, ESI†), we have proposed two symmetric catalysts (4 and 5) from the ease of synthesis perspective, which are found to be comparable in terms of activity with top performing catalysts.
For the CH3OH formation, the suitable ligands (–iPr and –tBu) mostly have a strong/moderate +I affect, in agreement with the experimentally reported suitable catalysts for the reaction.10,12 The –iPr is the most suitable ligand substitution in both the regions though –tBu is the group having the strongest +I effect. The presence of a –tBu group in both the regions together creates steric hindrance, which reduces the available space for reactants to approach the active center. This, in turn, has an adverse effect on the overall ΔΔG‡ of the reactions. In the case of HCOOH formation, the two different transition states (TS1 and TS3) play an important role and both these steps are chemically different in nature. Thus, catalysts having moderate +I and weak −I effecting groups are found to be the most suitable.
The natural bond orbital (NBO) analysis of the two catalysts having strong +I/−I effecting (Fig. S5, ESI†) groups showed that the presence of such groups largely affects the overall charge distribution of the catalysts. In the case of TS2, the +I effecting groups shifted the electron density towards the aromatic ring, thus stabilizing the overall transition state. At the same time, the higher positive charges on the P-atoms helped the heterolytic hydrogen cleavage, whereas the −I effecting groups pull the electron density from the ring thereby destabilizing the overall transition states for CH3OH formation. On the other hand, good catalysts for HCOOH formation should have an overall balanced charge distribution, which can control both TS1 and TS3. Therefore, moderate +I and weak −I effecting groups are suitable for the HCOOH formation. This proof-of-concept model may be very helpful for experimentalists to design catalysts in no time for the conversion of CO2 to HCOOH and CH3OH.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3ya00520h |
‡ A.D. and D.R. have contributed equally. |
§ Present address: SUNCAT Center for Interface Science and Catalysis, Department of Chemical Engineering, Stanford University, 443 Via Ortega, Stanford, CA 94305, USA. |
This journal is © The Royal Society of Chemistry 2024 |