Open Access Article
Hinata Sudo
a,
Yoshiki Hasukawa
a,
Rensuke Koiwai
a,
Fernando Garcia-Escobar
a,
Shun Nishimura
b,
Lauren Takahashi
*a and
Keisuke Takahashi
*ac
aDepartment of Chemistry, Hokkaido University, North 10, West 8, Sapporo 060-0810, Japan. E-mail: lauren.takahashi@sci.hokudai.ac.jp; keisuke.takahashi@sci.hokudai.ac.jp
bGraduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa 923-1292, Japan
cList Sustainable Digital Transformation Catalyst Collaboration Research Platform, Institute for Chemical Reaction Design and Discovery, Hokkaido University, Sapporo 001-0021, Japan
First published on 13th April 2026
The role of highly uniform, diverse experimental data in catalyst informatics is examined using an oxidative coupling of methane dataset measured by a single researcher under consistent devices and conditions. Broad compositional coverage and minimized experimental variability enable machine learning to capture composition–performance relationships using simple one-hot encoding. Inverse analysis of the compositional space identifies promising catalysts for experimental validation. These results demonstrate that carefully curated, well-distributed datasets, even if relatively small, enable machine learning to effectively capture composition–performance relationships.
In this work, the impact of uniformly generated and well-dispersed catalyst datasets produced by a single researcher and same experimental devices and environment is systematically investigated to elucidate how such data quality influences machine-learning performance. The oxidative coupling of methane (OCM) reaction is selected as a prototypical system. OCM aims at the direct conversion of methane into C2 hydrocarbons, primarily C2H4 and C2H6.19–21 The dataset used in this study is derived from previous work in which all catalysts are evaluated by a single researcher under strictly controlled experimental conditions, including identical experimental setups and standardized operating procedures.22 By minimizing experimental variability while maintaining broad compositional diversity, this dataset provides an ideal platform for catalyst informatics. Using this dataset, comprehensive data analysis and supervised machine-learning approaches are applied to uncover structure–performance relationships and to guide catalyst design for the OCM reaction.
:
1, and the furnace length is fixed at 270 mm. Furthermore, for comparison, preprocessed literature data are also ultilized.22 The physical quantities used for the catalyst descriptors are from the XenonPy library.23
Data preprocessing is performed prior to data analysis and machine learning. Data points exhibiting negative O2 or CH4 conversion values are removed. In addition, data points with selectivities exceeding 100% for H2, CO, CO2, C2H4, or C2H6 are excluded due to physical inconsistency and experimental noise. To eliminate the influence of varying experimental conditions and to focus on catalyst and support effects, data collected at 700 °C with a CH4/O2 ratio of 3
:
1 and a furnace length of 270 mm are extracted for machine-learning analysis. Furthermore, only binary and ternary catalyst compositions are retained. As a result, the number of data points is reduced to 2124. The catalyst composition and support are represented using one-hot encoding.
Supervised machine learning is performed using random forest regression (RFR) implemented in the scikit-learn.24 The random state is fixed, and the number of trees is set to 100. Model performance is evaluated by cross-validation, in which the dataset is randomly split into 80% training and 20% test sets. The reported performance corresponds to the average R2 score of the test data obtained from 10 independent train–test splits.
| Catalyst | Support | M1 | M2 | M3 |
|---|---|---|---|---|
| NaBaMg/La2O3 | La2O3 | NaNO3 | Mg(NO3)2·6H2O | Ba(NO3)2 |
| NaCaBa/La2O3 | La2O3 | NaNO3 | Ca(NO3)2·4H2O | Ba(NO3)2 |
| NaBaLa/La2O3 | La2O3 | NaNO3 | Ba(NO3)2 | La(NO3)3·6H2O |
For catalyst preparation, 2 g of La2O3 is added to 100 mL of deionized water. Metal nitrate salts are dissolved in 50 mL of deionized water so that the total molar fraction of the added metals is 3%. The metal precursor solution is added to the La2O3 suspension in the order M1, M2, and M3 at 5 min intervals under stirring, followed by stirring for 60 min and aging overnight at room temperature. The mixture is then heated under stirring to remove water. After drying at 80 °C for 8 h, the obtained solid is ground into a fine powder and calcined at 800 °C for 3 h, with heating and cooling rates of 800 °C h−1. A reference catalyst consisting of La2O3 only is prepared following the same procedure.
Catalytic performance is evaluated using a CH4/O2/N2 gas mixture with flow rates of 8.0/4.0/16.0 mL min−1 at 600, 650, 675, 700, 725, 750, 800, and 850 °C. At each temperature, the reaction is conducted for 10 min before gas sampling. Reaction products are analyzed using a Shimadzu GC-2014 gas chromatograph equipped with a SHINCARBON ST 50/80 column. Conversions of CH4 and O2, as well as yields of CO, CO2, C2H4, C2H6, and C2 products and C2 selectivity, are calculated using N2 as an internal standard according to eqn (1)–(4). In eqn (1) and (2), in and out represent the inlet feed and outlet effluent streams, respectively, which are used to calculate the conversion of reagents (R) such as CH4 and O2. In eqn (2), n is equal to 1 for CO and CO2 and 2 for C2H4 and C2H6.
![]() | (1) |
![]() | (2) |
| C2yield = C2H4Yield + C2H6Yield | (3) |
![]() | (4) |
As shown in Fig. 2(a), a pairwise correlation map is constructed between CH4 conversion, C2 yield, and selectivity, and the presence of catalyst elements and supports at 700 °C. Additionally, the feature importance of these catalyst elements and supports toward C2 yield is evaluated. Fig. 2(a) indicates that the La2O3 support exhibits a positive correlation with both CH4 conversion and C2 yield, suggesting that La2O3 is an active support under OCM reaction conditions. In a similar manner, certain alkaline earth elements, such as Sr and Ca, show weak but positive correlations with C2 yield, implying a potential promoting effect on C2 formation. Note that while Fig. 2(b) and (c) present the feature importance obtained from the random forest model, a high importance score does not necessarily imply a positive impact on the target variable. Random forest importance analysis of elements and supports is also performed for C2 yield as shown in Fig. 2(b) and (c), respectively. Fig. 2(b) reveals that Ca has the most predominant importance for C2 yield, followed by other alkaline earth metals such as Sr and Ba. Combining these results with the pairwise correlations in Fig. 2(a), it can be inferred that these specific alkaline earth elements exert a beneficial effect on C2 formation. Furthermore, although W also exhibits relatively high importance, its negative correlation in Fig. 2(a) suggests a deleterious effect on C2 yield. Similarly, Fig. 2(c) shows that La2O3 possesses the highest importance, which, consistent with Fig. 2(a), indicates its significant role in enhancing catalytic performance. While MgO also shows relatively high importance and a certain trend in Fig. 2(a), the underlying factors behind its contribution are discussed in detail in the following section in conjunction with the results shown in Fig. 3.
To evaluate the overall influence of catalyst elements and supports on C2 yield, violin plots are constructed. The distributions of C2 yield as a function of individual elements and supports are shown in Fig. 3. Fig. 3 indicates that catalysts containing Sr and Ca tend to exhibit higher C2 yields, consistent with the positive correlations observed in the pairwise correlation map as shown in Fig. 2. In addition, elements such as La and Mg are associated with relatively high C2 yields, whereas Cs, Bi, and Pb are generally linked to lower C2 yields. Notably, the interpretation is nontrivial because catalytic performance strongly depends on elemental combinations. As shown in Fig. 3(a), elements such as K, Na, Mn, and W are associated with both high and low C2 yields, depending on their pairing with other elements. This highlights the importance of combination effects rather than single-element contributions. Similarly, Fig. 3(b) shows that La2O3 supports tend to result in higher C2 yields, consistent with the analysis in Fig. 2. The catalytic performance may be attributed to the formation of subsurface peroxide species acting as active oxygen centers, which remain stable even at high temperatures.25 According to microkinetic analysis, these stable surface oxygen sites are suggested to promote methane dissociation and subsequent methyl radical formation, thereby facilitating gas-phase C2 coupling reactions.26 In a similar manner, Fig. 3(b) also indicates that MgO supports lead to enhanced C2 yields, which supports the trends observed in Fig. 2(a) and (c). This performance may be attributed to the presence of surface defects, particularly steps, which serve as active centers for the oxidative activation of methane. Kinetic studies suggest that these step sites facilitate a surface-mediated coupling process, contributing to high initial C2 selectivity.27 However, the support effect is also strongly coupled with the choice of active elements, as support performance varies substantially depending on elemental pairing.
Machine learning modeling is performed to predict highly active OCM catalysts. One-hot encoded representations of catalyst elements and supports are used as descriptor variables, while the objective variable is set to the C2 yield. Because experimental conditions strongly affect catalytic performance and can obscure composition performance relationships in machine learning analysis, the reaction temperature and CH4/O2 ratio are fixed at 700 °C and 3, respectively. The comparison between predicted and experimentally measured C2 yields is shown in Fig. 4(a). The model achieves a cross-validated coefficient of determination (R2) of 0.74, indicating good predictive performance under these constrained conditions. For comparison, 58 physical quantities in XenonPy are used, which result in an R2 of 0.74 as shown in Fig. 4(b); however, the MAE is a slightly better score in the one hot encoding case. Furthermore, literature data are also evaluated with one hot encoding where the temperature and CH4/O2 ratio are fixed at 700 °C and 3, respectively, and collected in Fig. 4(c). Fig. 4(c) shows that inconsistent data result in poor machine learning performance, and thus consistent data are quite important. For comparison, 58 physical quantities in XenonPy are used, which result in an R2 of 0.74 as shown in Fig. 4(b); however, the MAE is a slightly better score in the one-hot encoding case. Furthermore, literature data are also evaluated with one-hot encoding where the temperature and CH4/O2 ratio are fixed at 700 °C and 3, respectively, and collected in Fig. 4(c). Fig. 4(c) shows that inconsistent data result in poor machine learning performance, and thus consistent data are quite important. It should be noted that the well-diverse dataset, generated by a single researcher under identical devices and conditions, carries rich information about the catalysts, as the broad coverage of compositional space ensures that even simple one-hot encoding enables the model to extract relationships between catalyst composition and performance directly from the data. This work demonstrates that the quality and consistency of the dataset are critical. Collecting such data is labor intensive but reduces experimental noise and variability. Carefully curated, well-dispersed datasets can be more effective for machine learning than larger but heterogeneous datasets compiled from multiple sources.
![]() | ||
| Fig. 4 True and predicted C2 yield using (a) one hot encoding descriptors, (b) physical quantity descriptors, and (c) one hot encoding with literature data. | ||
An inverse analysis is performed to identify promising OCM catalysts. A total of 20
800 hypothetical binary and ternary catalyst combinations are generated from 25 elements (Na, Li, Mn, K, Sr, La, Ti, Ca, Ba, Mg, Rb, Y, Sm, Ce, Zn, Mo, Zr, Eu, Cs, Nd, Sn, W, Bi, Hf, and Pb) and 8 supports (BaO, CaO, La2O3, MgO, SiO2, TiO2, Y2O3, and ZnO). Each element–support combination is converted into a one-hot encoded representation and used as input for a trained random forest regression (RFR) model. The model is then applied to screen the full compositional space, and the top 3 predicted high C2 yield catalysts are summarized in Table 2.
| Support | M1 | M2 | M3 | Predicted C2y |
|---|---|---|---|---|
| La2O3 | Na | Ba | Mg | 15.914 |
| La2O3 | Na | Ca | Ba | 15.896 |
| La2O3 | Na | Ba | La | 15.869 |
Based on machine learning predictions, three catalyst compositions, Na–Ba–Mg/La2O3, Na–Ca–Ba/La2O3, and Na–Ba–La/La2O3, as shown in Table 2, are selected for experimental validation. The catalytic performances of these catalysts, together with La2O3 as a reference, are shown in Fig. 5. All experiments are independently repeated twice to confirm reproducibility. As shown in Fig. 5, all three machine-learning-guided catalysts exhibit substantially higher C2 yields than the La2O3 reference, demonstrating the effectiveness of the inverse design strategy. The maximum C2 yields achieved by each catalyst are 19.2% at 750 °C for Na–Ba–Mg/La2O3, 19.2% at 750 °C for Na–Ca–Ba/La2O3, and 18.6% at 725 °C for Na–Ba–La/La2O3, respectively.
| This journal is © The Royal Society of Chemistry 2026 |