Open Access Article
Xiao Niuab,
Zhiming Zhang
b,
Xiaoyu Wu
b,
Yan Liu
a,
Yong Cui
*a and
Jianwen Jiang
*b
aState Key Laboratory of Synergistic Chem-Bio Synthesis, School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China. E-mail: yongcui@sjtu.edu.cn
bDepartment of Chemical and Biomolecular Engineering, National University of Singapore, 117576, Singapore. E-mail: chejj@nus.edu.sg
First published on 18th February 2026
With remarkably tunable porosity and modular chemistry, metal–organic frameworks (MOFs) present a versatile platform for photocatalytic hydrogen (H2) production. However, identifying high-performing and water stable MOFs from the vast design space is challenging. In this study, we develop a hierarchical screening strategy to accelerate the discovery of photocatalytically active MOFs with robust water stability. First, machine learning (ML) classifiers are trained on experimental H2 production data to predict photocatalytic performance, achieving high accuracy and excellent transferability. Then, starting from 11660 structures in the CoRE-MOF database, 1731 are shortlisted to be photocatalytically active. Detailed structure–performance analyses reveal that linker flexibility and aliphatic character positively correlate with H2 evolution activity, while excessive aromaticity and rigidity are detrimental. Finally, a water stability classifier is applied to further identify 419 MOFs to be simultaneously photocatalytically promising and water stable. The ML-guided strategy provides a quantitative and interpretable path toward the discovery of new MOFs as photocatalysts, and it would facilitate future experimental exploration for efficient photocatalytic H2 production.
With a remarkable tunability of building blocks and modular architectures, metal–organic frameworks (MOFs) have attracted significant attention as photocatalysts for hydrogen evolution reaction (HER).6,7 By independently modulating metal nodes and organic linkers, MOFs allow for precise tuning of electronic band gap, redox potential, and charge transport pathway, which are essential for achieving suitable band alignment and efficient carrier dynamics in photocatalytic HER.8 A handful of MOFs such as UiO-66(Zr),9 MIL-125(Ti)10 and MIL-53(Al)11 have been investigated for HER. However, their photocatalytic performance is significantly below the requirements for practical applications. This is primarily due to their low solar-to-hydrogen (STH) conversion efficiency,12,13 strong dependence on sacrificial agent, and severe electron–hole recombination. It is highly desired to develop new MOF-based photocatalysts that are highly efficient, water stable and readily responsive to visible light, thereby advancing solar-driven H2 production. At present, over 120
000 MOFs have been experimentally synthesized and thus it is not feasible to identify promising MOFs from such a large chemical space by conventional trial-and-error methods.
In the past several years, machine learning (ML) has emerged as a transformative data-driven tool, playing an increasingly important role in structure prediction, property evaluation, and materials discovery.14,15 By learning complex structure–property relationships from a data set, ML can accelerate the screening and design of new materials. In the field of MOFs, ML has been successfully applied to predict gas adsorption and diffusion, mechanical strength, and water stability.16–18 There were also a few ML studies for photocatalytic water splitting. By combining density functional theory (DFT) calculations and ML, Wang et al. screened over 20
375 MOFs in the Quantum-MOF (QMOF) database and identified 14 MOFs with superior overall water splitting potential.19 Similarly, Mourino et al. also applied DFT and ML to evaluate 314 MOFs, highlighted the role of band gap, band alignment and charge separation in photocatalytic activity, and identified promising structural motifs such as Ti clusters and rod-shaped metal nodes.20 Despite demonstrating the robustness of ML, these studies utilized the data sets from DFT calculations, primarily emphasized electronic descriptors, and did not take the stability of MOFs into account.
To bridge the gap, in this study, we curated an experimental data set with MOFs as photocatalysts for water splitting and featurized MOFs with multi-level descriptors, then developed ML models for photocatalytic performance prediction, and, finally, water stability determination was incorporated. This approach enables high-throughput screening of MOFs with both high photocatalytic performance and practical water stability, thus accelerating the discovery of MOFs for H2 production. Fig. 1 illustrates the overall workflow. Specifically, a data set with 92 MOFs was curated with experimentally measured H2 evolution rates. The MOFs were featurized using a combination of geometric descriptors, building units (metals and linkers), and atomic property-weighted radial distribution functions (AP-RDFs). Different ML classifiers were trained and interpreted through feature importance analysis. To evaluate transferability, the classifiers were tested on 12 recent data points of 10 MOFs, including both unseen structures and new operating conditions. Next, the best classifier was applied to predict the photocatalytic performance of 9603 MOFs from the CoRE MOF database, identifying a set of top-performing MOFs. Retrospective comparison with recent experimental data confirmed the predictive accuracy. Finally, water stability of top-performing MOFs was evaluated via a recently developed ML model, resulting in the identification of MOFs with both high potential in HER and robust water tolerance.
![]() | ||
| Fig. 1 Workflow to predict photocatalytic performance of MOFs. (a) Data curation, (b) featurization, (c) model training, and (d) prediction and screening. | ||
Fig. S1a shows the correlation matrix of Pearson, Spearman and Kendall coefficients between geometric descriptors and H2 production rate. The geometric descriptors including the largest cavity diameter (LCD), pore limiting diameter (PLD), largest free path diameter (LFPD), density, volumetric surface area (VSA), gravimetric surface area (GSA), void fraction (VF) and pore volume (PV) were estimated by using Zeo++.22 Among these, VSA and GSA exhibit the most notable positive correlations with H2 production rate. This is further confirmed by the boxplot comparison between low- and high-performing MOFs in Fig. S1b, where high-performing MOFs generally possess larger surface areas. These findings suggest that surface accessibility and availability of active sites play a more significant role than pore diameter or volume alone in governing photocatalytic activity.8 A recent study highlighted that a material with a larger surface area and a higher porosity would tend to exhibit superior catalytic performance,23 which is consistent with our analysis based on the geometric descriptors.
It is also essential to evaluate structure–performance relationships at the level of building units. Fig. 2a illustrates the probability histogram of metal types in 92 collected MOFs. There is a significant imbalance, especially with Ti, Co and Zr dominating the data set. We observe considerable variation in the H2 production rate versus metal type. As shown in Fig. 2b, lanthanide-based MOFs (e.g., those containing Ho and Yb) exhibit a remarkably high median H2 production rate. This is attributed to the unique 4f electronic structures and large ionic radii of lanthanide metals, thus facilitating favorable photo-induced electron transfer. Although widely used in MOF synthesis, common transition metals such as Zn, Co and Zr are predominantly associated with low H2 production rates. Fig. 2c shows the count of metal types in high- and low-performing MOFs. Metals like Cu and Yb appear more frequently in high-performing MOFs, whereas Zr and Ti, despite their prevalence, primarily exist in low-performing MOFs. To further visualize the diversity of metal types, a periodic table heatmap is plotted in Fig. 2d, highlighting the localized clustering of catalytically active metals in the d-block and f-block regions. These findings underscore the importance of metal selection in the design of photocatalytic MOFs, suggesting that rare-earth and certain transition metals may offer enhanced activity due to their intrinsic electronic and coordination properties. However, the imbalance in metal types suggests a bias in the reported MOFs, with many underexplored metal candidates possibly offering untapped catalytic potential. Future research should therefore target less-explored metals, particularly those observed in high-performing outliers, to diversify the chemical space and improve the generalizability of predictive ML models.
To elucidate the underlying physicochemical factors, we further explored the correlation between key metal properties and photocatalytic performance. As shown in Fig. S2, high-performing MOFs generally contain metals with higher electronegativity and larger ionization energy, implying stronger electron-withdrawing ability and greater oxidation stability in promoting efficient charge separation. Metals with smaller atomic radii also tend to enhance activity due to a more compact coordination environment, whereas molecular weight shows a negligible correlation, confirming that electronic rather than mass-related factors dominate the photocatalytic behavior of MOFs.
Subsequently, we analyzed the relationship between organic linkers and photocatalytic performance. Representative linkers commonly employed in MOF construction were selected, mostly containing carboxylates as coordination sites, along with polar functional groups such as hydroxyl, amine and heterocycles (Fig. S3). These functional groups can enhance interactions with water through hydrogen bonding and facilitate visible-light absorption by modulating the local electronic environment.24,25 We calculated six RDKit-based molecular descriptors (RDKitDP), including molecular weight, partition coefficient (log
P), topological polar surface area (TPSA), aromatic ring count, and hydrogen bond (H-bond) donor and acceptor counts, for organic linkers. As evidenced in Fig. S2, high-performing MOFs exhibit larger TPSA and a greater H-bond acceptor count, both of which are positively correlated with Pearson, Spearman and Kendall coefficients. These properties of organic linkers improve surface hydrophilicity and strengthen hydrogen bonding with water, which in turn facilitates proton-coupled electron transfer during HER. Meanwhile, the H-bond donor count also exhibits a moderate positive trend, highlighting the importance of hydrogen bonding in forming a catalytically active microenvironment. Conversely, aromatic ring count and log
P display negative correlations with performance, as excessive aromaticity increases linker rigidity and promotes π–π stacking, thus hindering charge mobility and restricting reactant accessibility. Similarly, more hydrophobic linkers (i.e., higher log
P) reduce surface wettability, impeding water adsorption and interaction with active sites. Among the top 20 Molecular ACCess System (MACCS) fingerprints most correlated with photocatalytic activity (Fig. S4), several fragments show positive correlations. In particular, alkyl and amine linkages and polar functional groups possess the strongest correlations, suggesting that linker flexibility and hydrogen-bonding capability favor efficient charge transfer and water interaction. In contrast, oxygen-bridged or highly aromatic motifs possess weaker correlations, implying that excessive rigidity or hydrophobicity may limit catalytic efficiency. Overall, high-performing MOFs tend to integrate metal nodes with strong electron-withdrawing ability and organic linkers with balanced polarity, moderate hydrogen-bonding capacity, and limited aromatic rigidity, which collectively enhance charge separation, water adsorption, and photocatalytic turnover.
Different algorithms, including LightGBM, Random Forest (RF), Gradient Boosting (GB), Extremely Randomized Trees (ET), eXtreme Gradient Boosting (XGBoost) and Categorical Boosting (CatBoost) were applied to train ML classifiers. Among them, LightGBM was found to achieve the highest accuracy in terms of receiver operating characteristic (ROC) curves (Fig. S6). Then with the LightGBM classifier, various descriptor sets were examined. As indicated in Table S2, the combination of metal-based descriptors and operating conditions turned out to show the best predictive performance. Therefore, subsequent analyses were conducted using the LightGBM classifier, unless otherwise stated.
To improve classifier performance and eliminate redundant features, recursive feature elimination (RFE) was applied, reducing the feature dimensionality from 768 to 21 (Table S3). Based on the RFE-processed set, the normalized confusion matrix was predicted by the LightGBM classifier. As shown in Fig. 3b, strong classification performance is observed, with a true positive rate of 83% for the high-performing class and a true negative rate of 91% for the low-performing class. The average accuracy (ACC), positive predictive value (PPV), true positive rate (TPR), and F1 score reach 0.88, 0.83, 0.83, and 0.83, respectively (Table S4). The learning curves of the LightGBM classifier from 10 random splits of training/test sets demonstrate that the metric difference is reduced (Fig. 3c and S7), suggesting improved classifier generalizability. In addition, ROC curves with five-fold cross validation are presented in Fig. 3d. The mean area under the ROC curve (i.e., the mean AUC) reaches 0.83, further validating the robustness and predictive reliability of the LightGBM classifier.
It is instructive to quantify the effects of different features and interpret the learned classifier. Fig. 4a displays all the features based on average importance with five-fold cross validation. Among the features, operating conditions such as pH and C_Fermi_Level_aver emerge as the most influential ones, indicating that operating conditions play a dominant role in determining photocatalytic activity. As shown in Fig. 4b, operating conditions, AP-RDFs, linkers and metals cumulatively account for 51.8%, 30.7%, 10.4% and 7.1%, respectively, of the feature importance, suggesting that both operating conditions and structural descriptors jointly govern the prediction of the LightGBM classifier. This observation is further supported by SHapley Additive exPlanations (SHAP) analysis (Fig. S8), which reveals how individual descriptors influence the prediction direction and magnitude. For instance, a high log
P negatively shifts the prediction, implying that a hydrophobic linker would reduce photocatalytic activity. Several AP-RDFs capture the radial distributions of atomic-pair reactivity, reflecting how spatial organization and electronic environment affect light harvesting, carrier transport, and exciton migration. Additionally, M_Ionization_Energy (first ionization energy of a metal), a representative metal descriptor, exhibits a moderate level of importance and provides insight into the intrinsic redox behavior of a metal node. A metal with a lower ionization energy is generally more prone to participate in redox-mediated catalytic cycles, such as single-electron transfer (SET) or coordination-driven hydrogen evolution process. This characteristic is also correlated with the ability of metal center to stabilize radical or charged intermediates, thus modulating the energy landscape of photocatalytic reaction. Fig. 4c integrates these insights into a hierarchical Sankey diagram that illustrates the breakdown of permutation feature importance of different feature types. The diagram visually highlights the dominant contributions of operating conditions and AP-RDF descriptors, while metal and linker descriptors provide complementary structural and electronic information. Overall, these results underscore that optimal photocatalysts tend to balance the electronic structure, surface polarity, and catalytic site exposure of MOFs, while avoiding excessive hydrophobicity or overly delocalized electron density. The combination of interpretable features with domain-relevant chemistry provides useful guidance for future rational design of MOFs for photocatalytic H2 production.
The partial-dependence plots of H2 production rate on key descriptors are illustrated in Fig. S9. The rate increases sharply at a low C_Fermi_Level_aver or pH (Fig. S9a and b), suggesting strong enhancement of photocatalytic activity under acidic conditions and at a low Fermi level. Physicochemically, these descriptors promote water activation and charge separation at the MOF/solution interface. Similarly, Catalyst_content and Sa_Dipole_moment (i.e., the dipole moment of sacrificial agent) display strong positive effects (Fig. S9c and d), implying that increasing the density of active centers and the polarity of sacrificial agent facilitates reaction turnover. Additionally, there exists a distinct performance drop when L_MolLog
P (log
P of organic linker) exceeds ∼1.0 (Fig. S9e), indicating that an overly hydrophobic linker may impede water accessibility and interfacial proton transfer. Relatively, MACCSFP136 has a weaker effect on photocatalytic activity (Fig. S9f).
| Accuracy | Precision | TPR | F1 score | AUC |
|---|---|---|---|---|
| 0.83 | 1.00 | 0.78 | 0.88 | 0.85 |
660 structures in the CoRE-MOF database, we performed structural and descriptor-based filtering, resulting in a reduced set of 9603 MOFs. (II) The LightGBM classifier was employed to predict the photocatalytic performance of these MOFs and top-performing ones were shortlisted. (III) A recently developed water stability classifier for MOFs26 was subsequently utilized to identify stable structures among the top-performing MOFs. As summarized in Fig. 6, the number of MOFs was progressively reduced to 82.4%, 14.9%, and 3.6% at each step, respectively, ultimately yielding 419 water-stable top-performing MOFs. Notably, in step II, the number of MOFs was reduced significantly from 82.4% to 14.9%, underscoring the strong capability of the LightGBM classifier in efficiently narrowing down the candidate space. Among the 419 MOFs, Table 2 lists the top five structures, which possess the highest priority for experimental exploration. To assess the robustness of screening, the consistency of top-performing MOFs was examined across different quantile thresholds (Table S7). For the top 5 MOFs, the overlap rate reaches as high as 0.80. Such a high consistency for the top 5 MOFs is significant and it underscores that the top MOFs identified by the LightGBM classifier remain reliable regardless of the fine-tuning of performance boundaries, thereby ensuring the credibility of screened MOFs.
![]() | ||
| Fig. 6 Statistics in step (I–III) of the screening workflow. The blue and gray regions denote the percentages of shortlisted and discarded MOFs, respectively. | ||
To further evaluate the practical applicability of our ML classifier, we performed a retrospective validation by cross-referencing the predicted top MOFs with recent experimental data. As listed in Table 3, several MOFs including HAKSEU and HAKSOE were reported to exhibit excellent H2 evolution activity under visible light, with production rates exceeding 1000 µmol g−1 h−1.27 Although the operating conditions are different from those used in our predictions, the high predictive probability (0.92–0.93) highlights the robustness and generalizability of the LightGBM classifier across different structure-condition combinations. Notably, these MOFs also appear in our out-of-sample validation set but under different operating conditions. As the LightGBM classifier incorporates both structural and operating conditions as input features, the corresponding predictions reflect distinct structure-condition pairs and thus do not compromise the independence of validation process.
| MOF | Predictive probability | Experimental H2 rate (µmol g−1 h−1)27 |
|---|---|---|
| HAKSEU | 0.93 | 1005.40 |
| HAKSOE | 0.92 | 1478.75 |
Subsequently, we performed a comprehensive structural analysis of the 419 top-performing MOFs. As shown in Fig. 7a, the metal nodes in these MOFs are dominated by transition metals, such as Zn, Cd, Co and Cu. Fig. 7b displays the key geometric descriptors in these MOFs. Compared to the initial screened data set of 9603 MOFs, the top-performing MOFs tend to possess moderate pore sizes (LCD, PLD, and LFPD primarily in the range of 10–20 Å), higher void fractions, and larger accessible surface areas (VSA and GSA). These structural characteristics are favorable for enhancing mass transport, increasing the accessibility of catalytic sites, and creating a microenvironment conducive to photocatalytic reactions. For the organic linkers, representative structures shown in Fig. 7c reveal the prevalence of aromatic backbones (e.g., phenyl and pyridyl groups) and polar functional groups (e.g., carboxylates, amides and ethers). These functionalities not only facilitate stable coordination with metal nodes, but also enhance water adsorption and charge transfer by tuning the electronic and polar properties of MOFs. Additionally, the t-distributed stochastic neighbor embedding (t-SNE)28 map in Fig. S11 illustrates the distribution of top-performing MOFs in the overall feature space. While the top MOFs are broadly distributed, they exhibit discernible clustering patterns, suggesting that they share critical structural motifs despite their diversity. This reflects the ability of the LightGBM classifier to capture relevant features within a complex chemical landscape. The hierarchical screening significantly improves the efficiency of candidate selection while preserving structural diversity and chemical plausibility.
660 to 9603. Subsequently, 1731 structures are identified to be photocatalytically active, of which 419 are predicted to be water stable. Retrospective validation against experiments further supports the accuracy of our ML model. Overall, this study highlights the practical utilization of ML in discovering high-performing photocatalytic MOFs and provides a solid data-driven foundation for interpreting structure–performance relationships, and it would be insightful to guide the future rational design of MOFs for efficient photocatalytic H2 production.
It is worthwhile to note that the encoded structural and chemical descriptors in our ML model, such as metal node properties, linker polarity and pore architecture, may also hold relevance beyond photocatalysis, e.g., for electrocatalytic H2 production. However, inherent differences between photocatalytic and electrocatalytic reactions, including the presence of electrolytes, applied potentials and electrode interfaces, could affect model performance and limit direct transferability. Therefore, while our model may serve as a preliminary predictive tool for electrocatalytic H2 production, experimental validation and model refinement under relevant reaction conditions are essential to ensure predictive accuracy.
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
660 computation-ready CIF files. To ensure data quality, we first performed curation and structural filtering to remove entries with missing atoms or unreasonable geometries. Given that most of the 92 collected MOFs were sourced from the CoRE 2019 database and the CSD database, duplicate entries were identified and excluded. Finally, 9603 structures (∼80%) were retained for featurization using the same procedure described above and used for predictions. Although our collected data set only counts a small proportion of 9603 CoRE MOFs, it fairly well samples the chemical space of CoRE MOFs as shown in the t-SNE map (Fig. S14). All the descriptors were normalized and standardized by using the same parameters used to train ML classifiers. To ensure consistency, a standardized set of experimental operating conditions was applied, including representative values for catalyst concentration, cocatalyst loading, temperature, pH, and other key parameters (Table S12). By using these values, we aimed to ensure consistent and interpretable inputs when applying the classifier for new predictions.
Supplementary information (SI): structure-performance relationships; ML classifier performance; predictions for CoRE-MOFs; featurization. See DOI: https://doi.org/10.1039/d5sc08277c.
| This journal is © The Royal Society of Chemistry 2026 |