Nour
Alkhatib†
ab,
Saifeldeen
Abed Alrhman†
c and
Serdal
Kirmizialtin
*abd
aChemistry Program, Science Division, New York University Abu Dhabi, Abu Dhabi 129188, United Arab Emirates
bDepartment of Chemistry, New York University, New York, NY 10003, USA. E-mail: serdal@nyu.edu
cDepartment of Chemical and Biomolecular Engineering, New York University Tandon School of Engineering, New York, USA
dCenter for Smart Engineering Materials, New York University Abu Dhabi, Abu Dhabi 129188, United Arab Emirates
First published on 7th October 2025
Atmospheric water harvesting (AWH) using metal–organic frameworks (MOFs) offers a promising route to address freshwater scarcity in arid and off-grid environments. Yet, the structural and chemical factors that govern MOF performance remain insufficiently understood. Here, we combine high-throughput Grand Canonical Monte Carlo (GCMC) simulations with interpretable machine learning to study the structure–property relationships driving water uptake in MOFs. A chemically and structurally diverse set of 2600 frameworks was selected from the ARC-MOF database, and water uptake capacities were computed at 100% and 30% relative humidity. Among several regression models, Light Gradient Boosting Machine (LGBM) achieved the highest predictive accuracy. SHapley Additive exPlanations (SHAP) and correlation analyses identified adsorption energetics, local electrostatics (oxygen and hydrogen partial charges, metal electronegativity), and framework density as the dominant factors, with geometry acting as a secondary modulator. To provide an explicit analytical form for rapid screening and hypothesis generation, we constructed a second-order polynomial regression model using the top SHAP-ranked features. These results advance the fundamental understanding of water adsorption in MOFs and establish a scalable, data-driven framework for the rational design of high-performance materials for AWH.
Design, System, ApplicationThis work aims to establish a molecular design optimization strategy for atmospheric water harvesting using metal–organic frameworks (MOFs). The primary goal is to enhance the water uptake capacity of MOFs; however, the features responsible for higher performance remain unknown. In this study, we developed a robust methodology to identify the key physical and chemical features contributing to high uptake performance. This approach enables high-throughput screening and targeted functionalization of MOFs, facilitating the discovery and development of more efficient materials for atmospheric water harvesting. |
Metal–Organic Frameworks (MOFs) have emerged as promising candidates for sorbent-based AWH due to their high surface area, tunable porosity, and customizable surface chemistry.10–15 These features enable water capture at low relative humidity and regeneration at low temperatures, often using ambient heat or sunlight.16–18
Several MOFs have demonstrated outstanding AWH performance. For instance, MOF-303 exhibits a hydrophilic–hydrophobic pore environment and reaches 0.7 L (kg−1 per day) water uptake under desert conditions.19,20 MOF-333 achieves sharp uptake at 22% RH through moderate hydrogen bonding at μ2-OH sites, while balancing hydrophilic and hydrophobic domains.19 Derivatives such as MOF-LA2-1 enhance uptake by extending linker length, increasing pore volume without compromising binding affinity.21 Other innovations, such as the Mix-MOF hybrid,22 Co-MOF-31,23 and Cr-Spiro-5 (ref. 24) further illustrate how pore architecture, metal accessibility, linker chirality, and local polarity can be tuned to optimize performance and achieve record working capacity.
These examples demonstrate the design potential of MOFs, but they also reveal a broader challenge: the structure–function relationships that govern water uptake are complex and multifactorial. With an expansive design space spanning metal centers, organic linkers, functional groups, and topologies,25 cannot determine which features most strongly influence adsorption.
Recent studies have addressed this gap using machine learning (ML). Li et al.26 predicted uptake using six descriptors focused on pore geometry and energetics, identifying isosteric heat as most influential. Zhang et al.27 expanded the feature set to include metal identity and structural factors, but were limited by small datasets and coarse chemical resolution. While these models demonstrated predictive power, they lacked sufficient treatment of chemical features such as linker polarity, functional group charge, and local electrostatics, factors known to influence water cluster formation within confined pore environments.
In this study, we present a novel ML framework to investigate the structure–property relationships for water uptake in MOFs. For our model, we selected 2600 chemically and structurally diverse frameworks from the ARC-MOF database.28 Grand Canonical Monte Carlo (GCMC) simulations were performed to generate synthetic data at 100% and 30% relative humidities, mimicking saturated and low-humidity conditions. A comprehensive set of geometric, chemical, and energetic descriptors was extracted for those MOFs. By explicitly combining chemical and geometric features with interpretable machine learning through SHAP analysis, we identify factors controlling water uptake.
000 MOF architectures, ensuring broad chemical and structural diversity through cluster-based sampling.28 For each MOF, a wide range of features capturing both structural and chemical properties were selected. Geometric descriptors include metrics such as surface area and void fraction. Chemical characteristics, we include atomic property-weighted radial distribution functions (AP-RDF) to encode spatial correlations between atoms, revised autocorrelation (RAC) functions to capture local chemical environments and functional group chemistry, and the MAGPIE framework to describe elemental statistics relevant to hydrophilicity, electronegativity, and bonding character.28,29 Next, we conducted water adsorption simulations for each MOF using Grand Canonical Monte Carlo (GCMC) simulations under fixed temperature and pressure conditions. The simulated uptake capacities served as target values for supervised learning. We trained multiple regression models, such as random forest and gradient boosting, and selected the best-performing model based on cross-validation metrics and prediction accuracy. To interpret the model's behavior and identify the key factors driving adsorption, we applied SHAP (SHapley Additive exPlanations) analysis. SHAP allowed distinguishing the most impactful geometric and chemical features governing water uptake. Below, we provide a detailed description of our pipeline.
000 MOF structures, including around 12
310 experimentally reported MOFs, with the remainder being hypothetical structures.28 In addition, the dataset covers over 1000 features encompassing structural, chemical, and topological descriptors.28 Structural features include key parameters such as largest cavity diameter (LCD), pore limiting diameter (PLD), surface area, and void fraction, which are critical for characterizing the porosity, accessibility, and storage potential of MOFs.28,30,31
In addition to the structural descriptors, the database includes chemical descriptors relevant to the host-guest interactions.32 In particular, the dataset includes property-weighted atomic radial distribution function (AP-RDF) descriptors and revised autocorrelation (RAC) descriptors.28 AP-RDF descriptors incorporate atomic properties such as electronegativity, atomic hardness, van der Waals volume, dipole polarizability, and atomic mass. RAC descriptors are difference-based measures computed over a chemical graph and account for four primary chemical environments, such as metal-center chemistry, ligand chemistry, functional group chemistry, and linker connectivity.28 Here, any non-carbon or non-hydrogen heteroatom that was part of the linker and at the same time not part of the metal-coordination is defined as a functional group. The combined chemical and geometric descriptors enable a comprehensive description of the MOF dataset.
In addition, we generated composition-based features using the Materials Agnostic Platform for Informatics and Exploration (MAGPIE) framework.29 The MAGPIE descriptors encode chemically meaningful statistics derived from the elemental composition for materials informatics applications. We also constructed a stoichiometric descriptor set consisting of features representing the relative atomic fractions of each element within a MOF structure. Additionally, we include the maximum, minimum, and average atomic charges across each framework.
In addition to these features, We incorporated descriptors designed to capture local water–MOF interactions. Specifically, we introduced a solvent-accessible surface area (SASA)-weighted charge descriptor:
![]() | (1) |
533 MOFs from the full database. Rather than selecting one representative per cluster, we sampled representatives in proportion to each cluster's size to avoid bias. The MOFs from each cluster are selected at random. We generated features corresponding to each MOF.
Before model training, we examined the distribution of features and quantified their skewness using the measure:
![]() | (2) |
the sample mean, n the number of observations in the sample, and σ the sample standard deviation. Features with |s| ≥ 1 were classified as skewed and subjected to logarithmic transformation to avoid bias in the learning algorithm. Following this step, we used min–max scaling to each feature.
A supercell of at least 2 × 2 × 2 unit cells was generated for each structure. In cases where the primitive cell was too small, larger supercells were constructed to ensure that the simulation box exceeded twice the Lennard-Jones cutoff distance of 12.0 Å. van der Waals interactions between guest and host atoms were determined using the Lorentz–Berthelot mixing rules.
To determine topological properties, we computed the Helium void fraction using GCMC simulations. Water uptake capacities were calculated under conditions of 298 K and a water vapor pressure of 699.8 Pa, corresponding to 100% relative humidity and 209.4 Pa to 30% RH. Each GCMC simulation began with an initialization stage of 5000 cycles, followed by an equilibration and production phase of up to 1
000
000 cycles. Convergence was carefully monitored, and simulations were considered equilibrated when the average water loading stabilized within a 5% error margin, consistent with our earlier work.20 The resulting loading values were then used as labels for subsequent machine learning modeling and structure–function relationship analysis.
Multiple regression models were explored using the scikit-learn library, including Random Forest Regressor, Gradient Boosting Model, and Extra Trees Regressor, among others.38 Each model was optimized through hyperparameter tuning to achieve reliable predictive accuracy with minimal deviation from the GCMC calculated uptake values.
The distribution of simulated uptake capacities generated from GCMC is shown in Fig. 2 where water uptake was studied at two different relative humidities. At 100% RH (Fig. 2A), uptake values are broadly distributed, with several frameworks exceeding 800 mg g−1, reflecting their ability to store large amounts of water under saturated conditions. At 30% RH (Fig. 2B), uptake values are lower and concentrated below 200 mg g−1, indicating reduced adsorption. The correlation between 100% and 30% RH presented in (Fig. 2C) reveals that although some high-capacity materials are common with high and low humidity the overall correlation is weak, suggesting the need for a separate ML model.
The ML models were trained with the GCMC data as labels and features detailed in the Methods section. The dataset was used to train and evaluate regression models both with and without energetic descriptors. The best-performing models for each case are presented in Fig. 3, while results for additional models are provided in Fig. S1–S4.
At 100% RH, the inclusion of energetic descriptors led to the Light Gradient Boosting Machine (LGBM) achieving the highest predictive accuracy (Fig. 3A). The model yielded Rtrain2 = 0.967 and Rtest2 = 0.848, with MAE values of 17.97 (train) and 50.59 (test), and RMSE values of 41.90 (train) and 93.62 (test). In contrast, when energetic descriptors were excluded, the Multilayer Perceptron (MLP) was the best model (Fig. 3B), but with reduced performance where Rtrain2 = 0.897 and Rtest2 = 0.720, MAE values of 35.98 (train) and 65.13 (test), and RMSE values of 74.20 (train) and 126.01 (test).
A similar trend was observed at 30% RH. With energetic descriptors, the LGBM provided the better predictions (Fig. 3C), achieving Rtrain2 = 0.997 and Rtest2 = 0.791, with MAE values of 1.37 (train) and10.65 (test), and RMSE values of 2.24 (train) and 20.79 (test). In contrast, when energetic descriptors were excluded, the LGBM remained the best model but with lower accuracy (Fig. 3D), producing Rtrain2 = 0.953 and Rtest2 = 0.679, MAE values of 4.14 (train) and 12.39 (test), and RMSE values of 9.71 (train) and 25.74 (test). Additional models, including Extreme Gradient Boosting, Random Forest, Gradient Boosting Regressor, and Extra Trees Regressor, are presented in Fig. S1–S4 under the same conditions. These results confirm that energetic descriptors consistently enhance predictive power across the two humidity conditions, with the improvement being especially pronounced at 100% RH.
To interpret the predictions of the machine learning models and identify the key structural and chemical features of water uptake, we employed SHAP analysis combined with correlation analysis (Fig. 4). This approach highlights which descriptors most strongly influence model predictions and how they correlate with water uptake under different humidity conditions and descriptor sets.
At 100% RH with energetic descriptors (Fig. 4A), the most important features are the Henry coefficient, framework density, and the most negative oxygen charge. Additional descriptors such as the oxygen-to-metal ratio and the oxygen charge–weighted SASA ratio further emphasize the role of charge localization and accessible surface area in saturated adsorption.
When energetic descriptors are excluded (Fig. 4B), the top descriptors are accessible void fraction, linker electronegativity, and accessible surface area, which capture pore volume and polarity. Other high-ranking descriptors, including linker polarizability and donor–donor distance, highlight the role of framework chemistry and topology when energetic terms are absent.
At 30% RH with energetic descriptors (Fig. 4C), the Henry coefficient again dominates, followed by the most negative oxygen charge and the hydrogen charge–weighted SASA ratio. Their positive correlations with uptake indicate that localized electrostatics and charge distributions strongly influence adsorption under unsaturated conditions. Additional descriptors such as metal center electronegativity and oxygen content reflect the importance of the framework's electronic environment at lower humidity.
In contrast, at 30% RH without energetic descriptors (Fig. 4D), the model depends primarily on chemical descriptors. Interestingly, none of the geometric features were picked in the top list. The most important features are metal center electronegativity, linker group nuclear charge, and the most negative oxygen charge. The framework volume contributes only secondarily for the uptake at low humidity conditions.
The correlation heatmaps (Fig. 4E–H) complement the SHAP results by quantifying direct relationships among descriptors and uptake. At both humidities, energetic descriptors (Henry coefficient) display the strongest correlations with uptake, whereas purely geometric or chemical descriptors show weaker or more variable relationships. These patterns reinforce the dominant role of adsorption energetics, supported by chemical functionality as the primary enablers of water harvesting in MOFs.
Furthermore, to visualize how individual features influence water uptake, we plotted the top-ranked descriptors against uptake capacity at 100% and 30% RH when energetic descriptors are included (Fig. 5). At 100% RH (Fig. 5A–D), the Henry coefficient shows a strong correlation with uptake, confirming that the lower the Henry coefficient the higher the uptake for MOFs that has water uptake >500 mg g−1. For low uptake capacity MOFs we observe no correlation between uptake capacity and Henry coefficient (Fig. 5A). Framework density, give rise to a nonmonotonus dependence with water uptake (Fig. 5B). Materials with density between 0.8–1 gr cm−3 serves as the best performing MOFs. The most negative oxygen charge also show a unique region (Fig. 5C), indicating that oxygen atom charge of MOFs around −0.7–0.9 is optimum for water adsorption. Interestingly, this is range is the charge of water oxygen suggesting that the more water like the framework the better its uptake capacity at high humidities. Lastly, the near-metal neighbor size shows an inverse relationship to capacity (Fig. 5D). The smaller neighboring atom size the higher the uptake, suggesting metal site size plays a role by controlling the distance at which the waters approach to the linkers.
At 30% RH (Fig. 5E–H), similar trends are observed, although chemical descriptors play a more pronounced role. The Henry coefficient continues to differentiate high- and low-uptake MOFs (Fig. 5E), but the most negative oxygen charge emerges as particularly influential (Fig. 5F), reinforcing the importance of charge localization for adsorption at lower humidity. The hydrogen charge–weighted SASA ratio exhibits an optimal range where uptake is maximized (Fig. 5G). This descriptor captures the interplay between accessible surface area and favorable hydrogen bonding environments, suggesting that frameworks with balanced electrostatics and geometry are most effective. Finally, metal center electronegativity also pick a narrow window for best performance (Fig. 5H), indicating that there is a balance between electronegativity of the metal nodes and water interaction to promote water clustering at low humidities.
Finally, to quantify the influence of key structural and chemical features on water uptake, we constructed an interpretable polynomial regression model using the top five SHAP-ranked descriptors: X0 (Henry coefficient), X1 (framework density), X2 (most negative oxygen charge), X3 (near metal neighbor size), and X4 (oxygen to metal ratio). A second-order polynomial expansion was employed to capture both nonlinear dependencies and pairwise interactions among these variables. The resulting analytical expression, presented in eqn (1), offers a closed-form mapping from the descriptor space to predicted uptake capacity, enabling transparent interpretation of feature contributions.
y = 1519.83 − 17 749.21X0 + 2216.44X1 − 1684.98X2 − 5894.77X3 + 1329.73X4 + 44 326.62X20 − 47 304.39X0X1 + 23 780.40X0X2 + 39 182.92X0X3 − 23 218.41X0X4 − 1984.60X21 − 436.80X1X2 − 1559.90X1X3 + 1077.38X1X4 − 993.78X22 + 5939.34X2X3 − 363.51X2X4 + 3874.08X23 − 1479.58X3X4 − 666.10X24 | (3) |
The polynomial model was trained on the same dataset at 100% RH with energetic descriptors and evaluated using a parity plot comparing predicted and actual water uptake values (Fig. 6). The model achieves an R2 of 0.58, which is lower than that of the LGBM model (R2 = 0.85). Although this reduced-order model does not match the predictive accuracy of the LGBM model, it serves as a transparent and interpretable surrogate. Its simplicity makes it especially useful for rapid screening, hypothesis generation, and informing the rational design of new frameworks. Similar training for low humidity yielded a correlation score of 0.3 therefore not included as a reliable model.
To address this challenge, we developed a data-driven framework that combines Grand Canonical Monte Carlo (GCMC) simulations with interpretable machine learning to analyze structure–function relationships in a chemically and structurally diverse set of 2600 MOFs. The GCMC-derived uptake distributions reveal a pronounced contrast between performance at 100% and 30% relative humidity. While some frameworks achieve capacities exceeding 800 mg g−1 under saturation, adsorption at 30% RH is markedly lower and shows weak correlation with high-humidity performance. These results demonstrate that saturation capacity alone is not a reliable screening metric and underscore the importance of evaluating materials under application-relevant low-humidity conditions.
Machine learning analysis reveals that water uptake capacity is controlled primarily by adsorption energetics and the chemical environment. At both humidity levels, the Light Gradient Boosting Machine (LGBM) performed best when energetic descriptors such as the Henry coefficient were included, yielding Rtest2 = 0.85 at 100% RH and 0.79 at 30% RH. Removing these descriptors led to a marked drop in accuracy, showing that geometry alone cannot explain uptake behavior. Instead, capacity arises from the coupled effects of adsorption energetics and chemically driven interactions, with geometry playing a secondary but complementary role. This distinction reflects a broader difference between water and nonpolar gases while traditional MOF design strategies for gas storage have emphasized geometric optimization, including large pore volume, high surface area, and tailored pore size distributions, such approaches are less effective for polar adsorbates. For nonpolar gases like CO2 and CH4, uptake scales predictably with accessible volume and surface area, as molecules behave as nearly hard spheres that pack into available pore space.40,41 By contrast, water adsorption depends not only on confinement but also on hydrogen bonding, electrostatics, and cooperative cluster growth, making geometric expansion alone insufficient to guarantee higher uptake.
Our SHAP and correlation analyses clarify the multivariate nature of water adsorption. At 100% RH, adsorption energetics (Henry coefficient) dominate, complemented by structural packing (framework density) and charge localization (oxygen partial charges). At 30% RH, electrostatics become even more critical, with oxygen charge, hydrogen charge–weighted SASA ratios, and metal electronegativity shaping performance. Geometric features such as void fraction and accessible surface area act as secondary modulators, refining uptake capacity but rarely dictating performance alone. Henry coefficient and electronegative oxygen atoms correlate positively with uptake, while higher density and larger metal neighbors suppress adsorption. These insights align with recent experimental studies. For example, Hanikel et al.21 showed that MOF-LA2-1 outperforms MOF-303 due to a 40% increase in pore volume and surface area from extended linkers, yet our analysis indicates that such geometric expansion alone cannot explain high capacity. Instead, strong linker electronegativity and balanced electrostatic environments are the critical factors.18 Likewise, recent work on Zn5(OAc)4(TBTT)2 demonstrated that Ni exchange enhanced uptake to 0.98 g g−1 not only through increased porosity but also via open metal sites and greater hydrophilicity, strengthening water–framework interactions.42 Similarly, Ni2Br2BTDD outperformed its chlorinated analogue despite a smaller pore size, highlighting how subtle variations in electronic environment and confinement create liquid-like hydrogen bonding networks that favor adsorption.43 These findings reinforce that water uptake in MOFs arises from the coupled effects of energetics, chemistry, and structure, with structure playing a minor role.
The novelty of this study lies in combining chemically rich descriptors with SHAP interpretation to study the drivers of water uptake in diverse set of MOFs. Unlike prior ML studies that focused on geometric factors and investigated only best performing subset of MOFs, our analysis establishes a more extensive look at this problem. It demonstrates that adsorption capacity is primarily controlled by adsorption energetics and chemically driven interactions, with geometry playing a secondary role. Features such as linker electronegativity, oxygen and hydrogen partial charges, and metal coordination environment consistently emerged as key predictors across humidity conditions, highlighting the central role of local electrostatics in governing hydrogen bonding and cluster nucleation.
To translate these insights into an explicit and interpretable form, we constructed a second-order polynomial regression model using the top SHAP-ranked descriptors. Although its predictive accuracy (R2 = 0.58) is lower than that of the LGBM model, the closed-form expression captures nonlinear dependencies and interaction effects, with relatively large coefficients assigned to terms involving the Henry coefficient together with oxygen charge together with metal neighbor size. These coupled terms align with SHAP-derived trends, confirming that water uptake arises not from isolated features but from their interplay. Despite reduced accuracy, the polynomial model provides a transparent surrogate that enables hypothesis generation, rapid screening, and rational design.
Additionally, we note that neural network models were not included in this study. While such models can achieve strong predictive performance, they generally require larger datasets and more complex architectures than were investigated here. In addition, neural networks are computationally more demanding and provide limited interpretability, as feature attribution methods are less transparent. In contrast, tree-based ensemble methods such as LGBM offer both accuracy on small-to-moderate datasets and straightforward interpretability through SHAP analysis. This combination allowed us to identify structural and chemical factors dictating water uptake, an objective that would have been more challenging to achieve with neural network approaches.
Despite these advances, several limitations should be noted. First, framework flexibility and potential hysteresis in adsorption/desorption were not modeled, which may affect real-world behavior. Second, while our feature set is extensive, it may still omit descriptors related to linker dynamics or long-range interactions. Future work should integrate experimental validation of predicted high-performance candidates and extend the modeling framework to capture kinetic effects and multi-cycle stability.
Our focus in this study was limited to investigating the structure–function relationship for atmospheric water harvesting. The 2600 MOFs we analyzed comprise a small subset of the ARC-MOF database. We did not aim to screen the entire database; therefore, the top performers in our set may not be the best overall. That said, we identified one cluster with water uptake around 1621 mg g−1 and another around 1444 mg g−1. These values are on par with the best-performing MOFs reported for water uptake.17,44 Further research is needed to screen the full database and later to assess stability and costs in order to propose materials suitable for real-world applications.45
In conclusion, this study provides a scalable predictive platform and mechanistic insight into the factors that control water adsorption in MOFs. By combining molecular simulation, advanced feature engineering, and interpretable machine learning, we move beyond surface-area-based heuristics and offer chemically informed design rules. These findings support a shift toward rational, data-driven discovery of MOFs tailored for atmospheric water harvesting, an essential step toward sustainable water access in resource-constrained environments.
Supplementary information: parity plots and performance metrics of additional machine learning models at 30% and 100% relative humidity, with and without energetic descriptors. See DOI: https://doi.org/10.1039/d5me00101c.
Footnote |
| † Equal contribution. |
| This journal is © The Royal Society of Chemistry 2026 |