Machine learning highlights chemistry as the key factor in metal–organic frameworks for atmospheric water harvesting

Nour Alkhatib; Saifeldeen Abed Alrhman; Serdal Kirmizialtin

doi:10.1039/D5ME00101C

View PDF VersionPrevious ArticleNext Article

DOI: 10.1039/D5ME00101C (Paper) Mol. Syst. Des. Eng., 2026, 11, 50-61

Machine learning highlights chemistry as the key factor in metal–organic frameworks for atmospheric water harvesting

Nour Alkhatib† ^ab, Saifeldeen Abed Alrhman† ^c and Serdal Kirmizialtin *^abd
^aChemistry Program, Science Division, New York University Abu Dhabi, Abu Dhabi 129188, United Arab Emirates
^bDepartment of Chemistry, New York University, New York, NY 10003, USA. E-mail: serdal@nyu.edu
^cDepartment of Chemical and Biomolecular Engineering, New York University Tandon School of Engineering, New York, USA
^dCenter for Smart Engineering Materials, New York University Abu Dhabi, Abu Dhabi 129188, United Arab Emirates

Received 20th June 2025 , Accepted 6th October 2025

First published on 7th October 2025

Abstract

Atmospheric water harvesting (AWH) using metal–organic frameworks (MOFs) offers a promising route to address freshwater scarcity in arid and off-grid environments. Yet, the structural and chemical factors that govern MOF performance remain insufficiently understood. Here, we combine high-throughput Grand Canonical Monte Carlo (GCMC) simulations with interpretable machine learning to study the structure–property relationships driving water uptake in MOFs. A chemically and structurally diverse set of 2600 frameworks was selected from the ARC-MOF database, and water uptake capacities were computed at 100% and 30% relative humidity. Among several regression models, Light Gradient Boosting Machine (LGBM) achieved the highest predictive accuracy. SHapley Additive exPlanations (SHAP) and correlation analyses identified adsorption energetics, local electrostatics (oxygen and hydrogen partial charges, metal electronegativity), and framework density as the dominant factors, with geometry acting as a secondary modulator. To provide an explicit analytical form for rapid screening and hypothesis generation, we constructed a second-order polynomial regression model using the top SHAP-ranked features. These results advance the fundamental understanding of water adsorption in MOFs and establish a scalable, data-driven framework for the rational design of high-performance materials for AWH.

Design, System, Application

This work aims to establish a molecular design optimization strategy for atmospheric water harvesting using metal–organic frameworks (MOFs). The primary goal is to enhance the water uptake capacity of MOFs; however, the features responsible for higher performance remain unknown. In this study, we developed a robust methodology to identify the key physical and chemical features contributing to high uptake performance. This approach enables high-throughput screening and targeted functionalization of MOFs, facilitating the discovery and development of more efficient materials for atmospheric water harvesting.

Introduction

Access to clean water is an urgent global challenge, exacerbated by climate change, population growth, and pollution-induced stress on freshwater systems.^1,2 Existing solutions, such as desalination, are often energy-intensive and poorly suited to off-grid or inland regions, underscoring the need for decentralized and sustainable water generation technologies.³ Atmospheric water harvesting (AWH), which captures moisture directly from the air, offers a compelling alternative. Among various AWH strategies, including fog collection and dewing, sorbent-based systems stand out for their low energy requirements and ability to operate across a wide range of climates without dependence on large infrastructure.^4–9

Metal–Organic Frameworks (MOFs) have emerged as promising candidates for sorbent-based AWH due to their high surface area, tunable porosity, and customizable surface chemistry.^10–15 These features enable water capture at low relative humidity and regeneration at low temperatures, often using ambient heat or sunlight.^16–18

Several MOFs have demonstrated outstanding AWH performance. For instance, MOF-303 exhibits a hydrophilic–hydrophobic pore environment and reaches 0.7 L (kg⁻¹ per day) water uptake under desert conditions.^19,20 MOF-333 achieves sharp uptake at 22% RH through moderate hydrogen bonding at μ₂-OH sites, while balancing hydrophilic and hydrophobic domains.¹⁹ Derivatives such as MOF-LA2-1 enhance uptake by extending linker length, increasing pore volume without compromising binding affinity.²¹ Other innovations, such as the Mix-MOF hybrid,²² Co-MOF-31,²³ and Cr-Spiro-5 (ref. 24) further illustrate how pore architecture, metal accessibility, linker chirality, and local polarity can be tuned to optimize performance and achieve record working capacity.

These examples demonstrate the design potential of MOFs, but they also reveal a broader challenge: the structure–function relationships that govern water uptake are complex and multifactorial. With an expansive design space spanning metal centers, organic linkers, functional groups, and topologies,²⁵ cannot determine which features most strongly influence adsorption.

Recent studies have addressed this gap using machine learning (ML). Li et al.²⁶ predicted uptake using six descriptors focused on pore geometry and energetics, identifying isosteric heat as most influential. Zhang et al.²⁷ expanded the feature set to include metal identity and structural factors, but were limited by small datasets and coarse chemical resolution. While these models demonstrated predictive power, they lacked sufficient treatment of chemical features such as linker polarity, functional group charge, and local electrostatics, factors known to influence water cluster formation within confined pore environments.

In this study, we present a novel ML framework to investigate the structure–property relationships for water uptake in MOFs. For our model, we selected 2600 chemically and structurally diverse frameworks from the ARC-MOF database.²⁸ Grand Canonical Monte Carlo (GCMC) simulations were performed to generate synthetic data at 100% and 30% relative humidities, mimicking saturated and low-humidity conditions. A comprehensive set of geometric, chemical, and energetic descriptors was extracted for those MOFs. By explicitly combining chemical and geometric features with interpretable machine learning through SHAP analysis, we identify factors controlling water uptake.

Methods

Our analysis integrates molecular simulations, feature engineering, and supervised learning to uncover the structure–function relationship governing water uptake in MOFs. A schematic overview of the pipeline we created for this purpose is shown in Fig. 1. Namely, the workflow begins with the selection of a representative subset of MOFs from the ARC-MOF database, comprising about 280 [thin space (1/6-em)]

000 MOF architectures, ensuring broad chemical and structural diversity through cluster-based sampling.²⁸ For each MOF, a wide range of features capturing both structural and chemical properties were selected. Geometric descriptors include metrics such as surface area and void fraction. Chemical characteristics, we include atomic property-weighted radial distribution functions (AP-RDF) to encode spatial correlations between atoms, revised autocorrelation (RAC) functions to capture local chemical environments and functional group chemistry, and the MAGPIE framework to describe elemental statistics relevant to hydrophilicity, electronegativity, and bonding character.^28,29 Next, we conducted water adsorption simulations for each MOF using Grand Canonical Monte Carlo (GCMC) simulations under fixed temperature and pressure conditions. The simulated uptake capacities served as target values for supervised learning. We trained multiple regression models, such as random forest and gradient boosting, and selected the best-performing model based on cross-validation metrics and prediction accuracy. To interpret the model's behavior and identify the key factors driving adsorption, we applied SHAP (SHapley Additive exPlanations) analysis. SHAP allowed distinguishing the most impactful geometric and chemical features governing water uptake. Below, we provide a detailed description of our pipeline.


	Fig. 1 Schematic overview for structure–function investigation of water uptake in metal–organic frameworks (MOFs). The workflow begins with the clustering ARC-MOF database based on structural and chemical features. Grand Canonical Monte Carlo (GCMC) simulations allow computing water uptake for each MOF. Data is used to train machine learning (ML) models, the best of which is used to perform SHAP analysis to interpret the model.

MOF database and feature descriptors

A diverse set of MOFs was retrieved from the ab initio REPEAT charge MOF (ARC-MOF) database.²⁸ The ARC-MOF database comprises approximately 280 [thin space (1/6-em)]

000 MOF structures, including around 12 [thin space (1/6-em)]

310 experimentally reported MOFs, with the remainder being hypothetical structures.²⁸ In addition, the dataset covers over 1000 features encompassing structural, chemical, and topological descriptors.²⁸ Structural features include key parameters such as largest cavity diameter (LCD), pore limiting diameter (PLD), surface area, and void fraction, which are critical for characterizing the porosity, accessibility, and storage potential of MOFs.^28,30,31

In addition to the structural descriptors, the database includes chemical descriptors relevant to the host-guest interactions.³² In particular, the dataset includes property-weighted atomic radial distribution function (AP-RDF) descriptors and revised autocorrelation (RAC) descriptors.²⁸ AP-RDF descriptors incorporate atomic properties such as electronegativity, atomic hardness, van der Waals volume, dipole polarizability, and atomic mass. RAC descriptors are difference-based measures computed over a chemical graph and account for four primary chemical environments, such as metal-center chemistry, ligand chemistry, functional group chemistry, and linker connectivity.²⁸ Here, any non-carbon or non-hydrogen heteroatom that was part of the linker and at the same time not part of the metal-coordination is defined as a functional group. The combined chemical and geometric descriptors enable a comprehensive description of the MOF dataset.

In addition, we generated composition-based features using the Materials Agnostic Platform for Informatics and Exploration (MAGPIE) framework.²⁹ The MAGPIE descriptors encode chemically meaningful statistics derived from the elemental composition for materials informatics applications. We also constructed a stoichiometric descriptor set consisting of features representing the relative atomic fractions of each element within a MOF structure. Additionally, we include the maximum, minimum, and average atomic charges across each framework.

In addition to these features, We incorporated descriptors designed to capture local water–MOF interactions. Specifically, we introduced a solvent-accessible surface area (SASA)-weighted charge descriptor:


	(1)

where SASA_i is the solvent-accessible surface area contribution of atom i, q_i is the partial atomic charge, and ASA is the total accessible surface area of the framework. This descriptor reflects the role of charge distribution for water accessibility. Furthermore, we include nearest-neighbor distances between donor–acceptor, donor–donor, and acceptor–acceptor sites within the framework. These descriptors provide a direct quantification of hydrogen-bonding environments and local geometric interactions beyond conventional geometric and compositional descriptors.

Exploratory data analysis (EDA)

We utilized the ARC-MOF database, which categorizes MOFs into clusters based on metal node chemistry, organic linker type, framework topology, and functionalization/charge features.²⁸ From these pre-defined clusters, we selected metal Centers scheme and focused on the 500 largest clusters, which together include 260 [thin space (1/6-em)]

533 MOFs from the full database. Rather than selecting one representative per cluster, we sampled representatives in proportion to each cluster's size to avoid bias. The MOFs from each cluster are selected at random. We generated features corresponding to each MOF.

Before model training, we examined the distribution of features and quantified their skewness using the measure:


	(2)

where x_i is the individual data point, [x with combining macron]

the sample mean, n the number of observations in the sample, and σ the sample standard deviation. Features with |s| ≥ 1 were classified as skewed and subjected to logarithmic transformation to avoid bias in the learning algorithm. Following this step, we used min–max scaling to each feature.

Grand Canonical Monte Carlo simulations

Grand canonical Monte Carlo (GCMC) simulations were carried out to evaluate water adsorption properties under controlled thermodynamic conditions using the RASPA simulation package.³³ Water molecules were modeled using the TIP4P/2005 force field, which has been shown in previous studies to accurately reproduce the thermodynamic behavior of water vapor.^20,34 MOF frameworks were treated as rigid during the simulations, and their intramolecular interactions were modeled using the Universal Force Field (UFF).³⁵ Atomic partial charges were directly obtained from the ARC-MOF database, where REPEAT-derived charges were assigned based on periodic DFT calculations.^28,36

A supercell of at least 2 × 2 × 2 unit cells was generated for each structure. In cases where the primitive cell was too small, larger supercells were constructed to ensure that the simulation box exceeded twice the Lennard-Jones cutoff distance of 12.0 Å. van der Waals interactions between guest and host atoms were determined using the Lorentz–Berthelot mixing rules.

To determine topological properties, we computed the Helium void fraction using GCMC simulations. Water uptake capacities were calculated under conditions of 298 K and a water vapor pressure of 699.8 Pa, corresponding to 100% relative humidity and 209.4 Pa to 30% RH. Each GCMC simulation began with an initialization stage of 5000 cycles, followed by an equilibration and production phase of up to 1 [thin space (1/6-em)] 000000 cycles. Convergence was carefully monitored, and simulations were considered equilibrated when the average water loading stabilized within a 5% error margin, consistent with our earlier work.²⁰ The resulting loading values were then used as labels for subsequent machine learning modeling and structure–function relationship analysis.

Machine learning model development and testing

To train and evaluate the models, the dataset was randomly split into 95% training data and 5% testing data. We formulate the problem as a regression model, with the adsorption uptake value serving as the continuous target variable. Model performance was primarily assessed using the coefficient of determination (R² score), a metric that quantifies how well the predicted values approximate the true values.³⁷ While R² was chosen as the primary criterion due to its robustness against variations in the scale and distribution of the target values, we also report mean absolute error (MAE) and root mean squared error (RMSE) to provide a more comprehensive evaluation of model accuracy.

Multiple regression models were explored using the scikit-learn library, including Random Forest Regressor, Gradient Boosting Model, and Extra Trees Regressor, among others.³⁸ Each model was optimized through hyperparameter tuning to achieve reliable predictive accuracy with minimal deviation from the GCMC calculated uptake values.

Model interpretation and feature importance analysis

After training the machine learning models, feature importance analysis was conducted to interpret the structure–property relationships governing water adsorption in MOFs. The SHapley Additive exPlanations (SHAP) framework was employed to quantify the contribution of each input feature to the model predictions.³⁹ SHAP values provide a unified measure of feature importance by attributing changes in the predicted output to changes in individual input features. Furthermore, the directionality of feature impacts was analyzed by examining whether higher or lower feature values enhance adsorption performance.

Feature correlation via polynomial expression

To obtain an explicit expression correlating the input features with the output, multivariate polynomial regression was performed using the PolynomialFeatures transformer in conjunction with a linear regression model, as implemented in scikit-learn. This approach expands the original feature set to include all interactions and self-product terms up to a specified polynomial degree, allowing for the modeling of nonlinear relationships using a linear estimator. The transformed feature matrix was then used to fit a least-squares regression model, from which the analytical form of the polynomial was extracted. We used the top five features as per the SHAP analysis provided. The quality of the resulting expression is assessed by a parity plot. While black-box models (e.g., Random Forest or LightGBM) generally achieve higher predictive accuracy, they lack the transparency needed to guide experimental design in an interconnected manner. By using a polynomial model, we strike a balance between interpretability and predictive power, enabling experimentalists to extract actionable relationships from the data while still maintaining reasonable accuracy. We emphasize that polynomial models are not intended to replace high-performance machine learning models but to complement them by providing transparent guidance.

Results

The study integrates database screening, feature extraction, molecular simulations, machine learning modeling, and model interpretation to establish the structure–function relationship of metal–organic frameworks for atmospheric water harvesting (AWH) applications. To represent the MOF chemical space, we used ARC-MOF database that identifies about 2600 distinct MOFs. For those structures, we performed GCMC simulations to compute water adsorption at 100% and 30% relative humidity (see Grand Canonical Monte Carlo simulations in Methods section for details). Following data generation we used ML to identify the structure–function relationships for AWH. A schematic overview of the methodology is shown in Fig. 1. Further details of our approach can be found in the Methods section.

The distribution of simulated uptake capacities generated from GCMC is shown in Fig. 2 where water uptake was studied at two different relative humidities. At 100% RH (Fig. 2A), uptake values are broadly distributed, with several frameworks exceeding 800 mg g⁻¹, reflecting their ability to store large amounts of water under saturated conditions. At 30% RH (Fig. 2B), uptake values are lower and concentrated below 200 mg g⁻¹, indicating reduced adsorption. The correlation between 100% and 30% RH presented in (Fig. 2C) reveals that although some high-capacity materials are common with high and low humidity the overall correlation is weak, suggesting the need for a separate ML model.


	Fig. 2 Distributions of water uptake capacities obtained from GCMC simulations. A. Distribution of uptake capacities at 100% RH across the dataset. B. Distribution of uptake capacities at 30% RH across the dataset. C. Comparison of uptake capacities at 100% RH versus 30% RH, showing the relationship between low- and high-humidity performance. The dashed line indicates equal uptake at both conditions.

The ML models were trained with the GCMC data as labels and features detailed in the Methods section. The dataset was used to train and evaluate regression models both with and without energetic descriptors. The best-performing models for each case are presented in Fig. 3, while results for additional models are provided in Fig. S1–S4.


	Fig. 3 Machine learning performance for predicting water uptake capacity at different relative humidities. A. Parity plot at 100% RH with energetic descriptors using Light Gradient Boosting Machine (LGBM) as the best model. B. Parity plot at 100% RH without energetic descriptors using multilayer perceptron (MLP) as the best model. C. Parity plot at 30% RH with energetic descriptors using LGBM as the best model. D. Parity plot at 30% RH without energetic descriptors using LGBM as the best model. Red points represent training data, and blue points represent test data. The dashed line indicates perfect prediction.

At 100% RH, the inclusion of energetic descriptors led to the Light Gradient Boosting Machine (LGBM) achieving the highest predictive accuracy (Fig. 3A). The model yielded R_train² = 0.967 and R_test² = 0.848, with MAE values of 17.97 (train) and 50.59 (test), and RMSE values of 41.90 (train) and 93.62 (test). In contrast, when energetic descriptors were excluded, the Multilayer Perceptron (MLP) was the best model (Fig. 3B), but with reduced performance where R_train² = 0.897 and R_test² = 0.720, MAE values of 35.98 (train) and 65.13 (test), and RMSE values of 74.20 (train) and 126.01 (test).

A similar trend was observed at 30% RH. With energetic descriptors, the LGBM provided the better predictions (Fig. 3C), achieving R_train² = 0.997 and R_test² = 0.791, with MAE values of 1.37 (train) and10.65 (test), and RMSE values of 2.24 (train) and 20.79 (test). In contrast, when energetic descriptors were excluded, the LGBM remained the best model but with lower accuracy (Fig. 3D), producing R_train² = 0.953 and R_test² = 0.679, MAE values of 4.14 (train) and 12.39 (test), and RMSE values of 9.71 (train) and 25.74 (test). Additional models, including Extreme Gradient Boosting, Random Forest, Gradient Boosting Regressor, and Extra Trees Regressor, are presented in Fig. S1–S4 under the same conditions. These results confirm that energetic descriptors consistently enhance predictive power across the two humidity conditions, with the improvement being especially pronounced at 100% RH.

To interpret the predictions of the machine learning models and identify the key structural and chemical features of water uptake, we employed SHAP analysis combined with correlation analysis (Fig. 4). This approach highlights which descriptors most strongly influence model predictions and how they correlate with water uptake under different humidity conditions and descriptor sets.


	Fig. 4 Feature importance and inter-descriptor correlations at different humidities and descriptor sets. A–D. SHAP summary plots ranking the ten most important descriptors by mean absolute SHAP value, quantifying their contributions to water uptake prediction. Feature values are color-coded from low (blue) to high (red). A. 100% RH with energetic descriptors. B. 100% RH without energetic descriptors. C. 30% RH with energetic descriptors. D. 30% RH without energetic descriptors. E–H. Pearson correlation heatmaps of the same descriptor sets, illustrating linear correlations among top features and their relationship to uptake capacity. E. 100% RH with energetic descriptors. F. 100% RH without energetic descriptors. G. 30% RH with energetic descriptors. H. 30% RH without energetic descriptors.

At 100% RH with energetic descriptors (Fig. 4A), the most important features are the Henry coefficient, framework density, and the most negative oxygen charge. Additional descriptors such as the oxygen-to-metal ratio and the oxygen charge–weighted SASA ratio further emphasize the role of charge localization and accessible surface area in saturated adsorption.

When energetic descriptors are excluded (Fig. 4B), the top descriptors are accessible void fraction, linker electronegativity, and accessible surface area, which capture pore volume and polarity. Other high-ranking descriptors, including linker polarizability and donor–donor distance, highlight the role of framework chemistry and topology when energetic terms are absent.

At 30% RH with energetic descriptors (Fig. 4C), the Henry coefficient again dominates, followed by the most negative oxygen charge and the hydrogen charge–weighted SASA ratio. Their positive correlations with uptake indicate that localized electrostatics and charge distributions strongly influence adsorption under unsaturated conditions. Additional descriptors such as metal center electronegativity and oxygen content reflect the importance of the framework's electronic environment at lower humidity.

In contrast, at 30% RH without energetic descriptors (Fig. 4D), the model depends primarily on chemical descriptors. Interestingly, none of the geometric features were picked in the top list. The most important features are metal center electronegativity, linker group nuclear charge, and the most negative oxygen charge. The framework volume contributes only secondarily for the uptake at low humidity conditions.

The correlation heatmaps (Fig. 4E–H) complement the SHAP results by quantifying direct relationships among descriptors and uptake. At both humidities, energetic descriptors (Henry coefficient) display the strongest correlations with uptake, whereas purely geometric or chemical descriptors show weaker or more variable relationships. These patterns reinforce the dominant role of adsorption energetics, supported by chemical functionality as the primary enablers of water harvesting in MOFs.

Furthermore, to visualize how individual features influence water uptake, we plotted the top-ranked descriptors against uptake capacity at 100% and 30% RH when energetic descriptors are included (Fig. 5). At 100% RH (Fig. 5A–D), the Henry coefficient shows a strong correlation with uptake, confirming that the lower the Henry coefficient the higher the uptake for MOFs that has water uptake >500 mg g⁻¹. For low uptake capacity MOFs we observe no correlation between uptake capacity and Henry coefficient (Fig. 5A). Framework density, give rise to a nonmonotonus dependence with water uptake (Fig. 5B). Materials with density between 0.8–1 gr cm⁻³ serves as the best performing MOFs. The most negative oxygen charge also show a unique region (Fig. 5C), indicating that oxygen atom charge of MOFs around −0.7–0.9 is optimum for water adsorption. Interestingly, this is range is the charge of water oxygen suggesting that the more water like the framework the better its uptake capacity at high humidities. Lastly, the near-metal neighbor size shows an inverse relationship to capacity (Fig. 5D). The smaller neighboring atom size the higher the uptake, suggesting metal site size plays a role by controlling the distance at which the waters approach to the linkers.


	Fig. 5 Relationship between water uptake capacity and top-ranked SHAP descriptors. Scatter plots show uptake capacity as a function of the four most important features at 100% RH and 30% RH when energetic descriptors are included. Data points are color-coded by uptake capacity (mg g⁻¹). A–D. Top four descriptors at 100% RH with energetic descriptors. A. Henry coefficient. B. Framework density. C. Most negative oxygen charge. D. Near-metal neighbor size. E–H. Top four descriptors at 30% RH with energetic descriptors. E. Henry coefficient. F. Most negative oxygen charge. G. Hydrogen charge–weighted SASA ratio. H. Metal center electronegativity.

At 30% RH (Fig. 5E–H), similar trends are observed, although chemical descriptors play a more pronounced role. The Henry coefficient continues to differentiate high- and low-uptake MOFs (Fig. 5E), but the most negative oxygen charge emerges as particularly influential (Fig. 5F), reinforcing the importance of charge localization for adsorption at lower humidity. The hydrogen charge–weighted SASA ratio exhibits an optimal range where uptake is maximized (Fig. 5G). This descriptor captures the interplay between accessible surface area and favorable hydrogen bonding environments, suggesting that frameworks with balanced electrostatics and geometry are most effective. Finally, metal center electronegativity also pick a narrow window for best performance (Fig. 5H), indicating that there is a balance between electronegativity of the metal nodes and water interaction to promote water clustering at low humidities.

Finally, to quantify the influence of key structural and chemical features on water uptake, we constructed an interpretable polynomial regression model using the top five SHAP-ranked descriptors: X₀ (Henry coefficient), X₁ (framework density), X₂ (most negative oxygen charge), X₃ (near metal neighbor size), and X₄ (oxygen to metal ratio). A second-order polynomial expansion was employed to capture both nonlinear dependencies and pairwise interactions among these variables. The resulting analytical expression, presented in eqn (1), offers a closed-form mapping from the descriptor space to predicted uptake capacity, enabling transparent interpretation of feature contributions.


y = 1519.83 − 17749.21X₀ + 2216.44X₁ − 1684.98X₂ − 5894.77X₃ + 1329.73X₄ + 44326.62X²₀ − 47304.39X₀X₁ + 23780.40X₀X₂ + 39182.92X₀X₃ − 23218.41X₀X₄ − 1984.60X²₁ − 436.80X₁X₂ − 1559.90X₁X₃ + 1077.38X₁X₄ − 993.78X²₂ + 5939.34X₂X₃ − 363.51X₂X₄ + 3874.08X²₃ − 1479.58X₃X₄ − 666.10X²₄	(3)

The polynomial model was trained on the same dataset at 100% RH with energetic descriptors and evaluated using a parity plot comparing predicted and actual water uptake values (Fig. 6). The model achieves an R² of 0.58, which is lower than that of the LGBM model (R² = 0.85). Although this reduced-order model does not match the predictive accuracy of the LGBM model, it serves as a transparent and interpretable surrogate. Its simplicity makes it especially useful for rapid screening, hypothesis generation, and informing the rational design of new frameworks. Similar training for low humidity yielded a correlation score of 0.3 therefore not included as a reliable model.


	Fig. 6 Parity plot and polynomial regression model for water uptake from top five SHAP-ranked descriptors. The model incorporates linear, quadratic, and pairwise interaction terms of Henry coefficient (X₀), framework density (X₁), most negative oxygen charge (X₂), near metal neighbor size (X₃), and oxygen-to-metal ratio (X₄). Blue circles denote model predictions and the dashed line corresponds to perfect parity (y = x).

Discussion and conclusion

Metal–organic frameworks (MOFs) offer considerable promise for atmospheric water harvesting (AWH) due to their high surface area, tunable porosity, and chemically versatile structures. These features enable efficient water adsorption even under low-humidity conditions and allow for water release at low regeneration energies, making MOFs well-suited for off-grid and sustainable water capture systems. However, a fundamental understanding of the structural and chemical factors governing water uptake remains limited, posing a barrier to the rational design of next-generation AWH materials.

To address this challenge, we developed a data-driven framework that combines Grand Canonical Monte Carlo (GCMC) simulations with interpretable machine learning to analyze structure–function relationships in a chemically and structurally diverse set of 2600 MOFs. The GCMC-derived uptake distributions reveal a pronounced contrast between performance at 100% and 30% relative humidity. While some frameworks achieve capacities exceeding 800 mg g⁻¹ under saturation, adsorption at 30% RH is markedly lower and shows weak correlation with high-humidity performance. These results demonstrate that saturation capacity alone is not a reliable screening metric and underscore the importance of evaluating materials under application-relevant low-humidity conditions.

Machine learning analysis reveals that water uptake capacity is controlled primarily by adsorption energetics and the chemical environment. At both humidity levels, the Light Gradient Boosting Machine (LGBM) performed best when energetic descriptors such as the Henry coefficient were included, yielding R_test² = 0.85 at 100% RH and 0.79 at 30% RH. Removing these descriptors led to a marked drop in accuracy, showing that geometry alone cannot explain uptake behavior. Instead, capacity arises from the coupled effects of adsorption energetics and chemically driven interactions, with geometry playing a secondary but complementary role. This distinction reflects a broader difference between water and nonpolar gases while traditional MOF design strategies for gas storage have emphasized geometric optimization, including large pore volume, high surface area, and tailored pore size distributions, such approaches are less effective for polar adsorbates. For nonpolar gases like CO₂ and CH₄, uptake scales predictably with accessible volume and surface area, as molecules behave as nearly hard spheres that pack into available pore space.^40,41 By contrast, water adsorption depends not only on confinement but also on hydrogen bonding, electrostatics, and cooperative cluster growth, making geometric expansion alone insufficient to guarantee higher uptake.

Our SHAP and correlation analyses clarify the multivariate nature of water adsorption. At 100% RH, adsorption energetics (Henry coefficient) dominate, complemented by structural packing (framework density) and charge localization (oxygen partial charges). At 30% RH, electrostatics become even more critical, with oxygen charge, hydrogen charge–weighted SASA ratios, and metal electronegativity shaping performance. Geometric features such as void fraction and accessible surface area act as secondary modulators, refining uptake capacity but rarely dictating performance alone. Henry coefficient and electronegative oxygen atoms correlate positively with uptake, while higher density and larger metal neighbors suppress adsorption. These insights align with recent experimental studies. For example, Hanikel et al.²¹ showed that MOF-LA2-1 outperforms MOF-303 due to a 40% increase in pore volume and surface area from extended linkers, yet our analysis indicates that such geometric expansion alone cannot explain high capacity. Instead, strong linker electronegativity and balanced electrostatic environments are the critical factors.¹⁸ Likewise, recent work on Zn₅(OAc)₄(TBTT)₂ demonstrated that Ni exchange enhanced uptake to 0.98 g g⁻¹ not only through increased porosity but also via open metal sites and greater hydrophilicity, strengthening water–framework interactions.⁴² Similarly, Ni₂Br₂BTDD outperformed its chlorinated analogue despite a smaller pore size, highlighting how subtle variations in electronic environment and confinement create liquid-like hydrogen bonding networks that favor adsorption.⁴³ These findings reinforce that water uptake in MOFs arises from the coupled effects of energetics, chemistry, and structure, with structure playing a minor role.

The novelty of this study lies in combining chemically rich descriptors with SHAP interpretation to study the drivers of water uptake in diverse set of MOFs. Unlike prior ML studies that focused on geometric factors and investigated only best performing subset of MOFs, our analysis establishes a more extensive look at this problem. It demonstrates that adsorption capacity is primarily controlled by adsorption energetics and chemically driven interactions, with geometry playing a secondary role. Features such as linker electronegativity, oxygen and hydrogen partial charges, and metal coordination environment consistently emerged as key predictors across humidity conditions, highlighting the central role of local electrostatics in governing hydrogen bonding and cluster nucleation.

To translate these insights into an explicit and interpretable form, we constructed a second-order polynomial regression model using the top SHAP-ranked descriptors. Although its predictive accuracy (R² = 0.58) is lower than that of the LGBM model, the closed-form expression captures nonlinear dependencies and interaction effects, with relatively large coefficients assigned to terms involving the Henry coefficient together with oxygen charge together with metal neighbor size. These coupled terms align with SHAP-derived trends, confirming that water uptake arises not from isolated features but from their interplay. Despite reduced accuracy, the polynomial model provides a transparent surrogate that enables hypothesis generation, rapid screening, and rational design.

Additionally, we note that neural network models were not included in this study. While such models can achieve strong predictive performance, they generally require larger datasets and more complex architectures than were investigated here. In addition, neural networks are computationally more demanding and provide limited interpretability, as feature attribution methods are less transparent. In contrast, tree-based ensemble methods such as LGBM offer both accuracy on small-to-moderate datasets and straightforward interpretability through SHAP analysis. This combination allowed us to identify structural and chemical factors dictating water uptake, an objective that would have been more challenging to achieve with neural network approaches.

Despite these advances, several limitations should be noted. First, framework flexibility and potential hysteresis in adsorption/desorption were not modeled, which may affect real-world behavior. Second, while our feature set is extensive, it may still omit descriptors related to linker dynamics or long-range interactions. Future work should integrate experimental validation of predicted high-performance candidates and extend the modeling framework to capture kinetic effects and multi-cycle stability.

Our focus in this study was limited to investigating the structure–function relationship for atmospheric water harvesting. The 2600 MOFs we analyzed comprise a small subset of the ARC-MOF database. We did not aim to screen the entire database; therefore, the top performers in our set may not be the best overall. That said, we identified one cluster with water uptake around 1621 mg g⁻¹ and another around 1444 mg g⁻¹. These values are on par with the best-performing MOFs reported for water uptake.^17,44 Further research is needed to screen the full database and later to assess stability and costs in order to propose materials suitable for real-world applications.⁴⁵

In conclusion, this study provides a scalable predictive platform and mechanistic insight into the factors that control water adsorption in MOFs. By combining molecular simulation, advanced feature engineering, and interpretable machine learning, we move beyond surface-area-based heuristics and offer chemically informed design rules. These findings support a shift toward rational, data-driven discovery of MOFs tailored for atmospheric water harvesting, an essential step toward sustainable water access in resource-constrained environments.

Author contributions

Conceptualization: S. K.; methodology: N. K., S. A. A., S. K.; data curation: N. K., S. A. A.; validation: N. K., S. A. A.; formal analysis: N. K., S. A. A.; investigation: N. K., S. A. A., S. K.; funding acquisition: S. K.; project administration: S. K.; resources: S. K.; supervision: S. K.; visualization: N. K., S. K.; writing – original draft: N. K., S. K.; writing – review & editing: N. K., S. A. A., S. K.

Conflicts of interest

There are no conflicts to declare.

Data availability

This study was carried out using publicly available data from the ARC-MOF database at https://pubs.acs.org/doi/10.1021/acs.chemmater.2c02485.

Supplementary information: parity plots and performance metrics of additional machine learning models at 30% and 100% relative humidity, with and without energetic descriptors. See DOI: https://doi.org/10.1039/d5me00101c.

Acknowledgements

Computational research was carried out on High-Performance Computing resources at New York University Abu Dhabi. This research is supported by the AD181 faculty research grant. This material is based upon works supported by Tamkeen under NYUAD RRC Grant No. CG011.

References

M. M. Mekonnen and A. Y. Hoekstra, Four billion people facing severe water scarcity, Sci. Adv., 2016, 2, e1500323 CrossRef PubMed .
A. Du Plessis and A. du Plessis, Current and future water scarcity and stress, Water as an inescapable risk: current global water availability, quality and risks with a specific focus on South Africa, 2019, pp. 13–25 Search PubMed .
V. G. Gude, Desalination and sustainability–an appraisal and current perspective, Water Res., 2016, 89, 87–106 CrossRef CAS .
I. Borne and A. I. Cooper, Sorbent-based atmospheric water harvesting: engineering challenges from the process to molecular scale, J. Mater. Chem. A, 2025, 13, 4838–4850 RSC .
T. Xiang, S. Xie, G. Chen, C. Zhang and Z. Guo, Recent advances in atmospheric water harvesting technology and its development, Mater. Horiz., 2025, 12, 1084–1105 RSC .
A. Lee, M.-W. Moon, H. Lim, W.-D. Kim and H.-Y. Kim, Water harvest via dewing, Langmuir, 2012, 28, 10183–10191 CrossRef CAS PubMed .
M. Qadir, G. C. Jiménez, R. L. Farnum and P. Trautwein, Research history and functional systems of fog water harvesting, Front. Water, 2021, 3, 675269 CrossRef .
W. Shi, W. Guan, C. Lei and G. Yu, Sorbents for atmospheric water harvesting: from design principles to applications, Angew. Chem., 2022, 134, e202211267 CrossRef .
M. Bilal, M. Sultan, T. Morosuk, W. Den, U. Sajjad, M. M. Aslam, M. W. Shahzad and M. Farooq, Adsorption-based atmospheric water harvesting: A review of adsorbents and systems, Int. Commun. Heat Mass Transfer, 2022, 133, 105961 CrossRef CAS .
J. Wang, W. Ying, L. Hua, H. Zhang and R. Wang, Global water yield strategy for metal-organic-framework-assisted atmospheric water harvesting, Cell Rep. Phys. Sci., 2023, 4(12), 101742 CrossRef CAS .
H. Lu, W. Shi, Y. Guo, W. Guan, C. Lei and G. Yu, Materials engineering for atmospheric water harvesting: progress and perspectives, Adv. Mater., 2022, 34, 2110079 CrossRef CAS .
T. Z. Wasti, M. Sultan, M. Aleem, U. Sajjad, M. Farooq, H. M. Raza, M. U. Khan and S. Noor, An overview of solid and liquid materials for adsorption-based atmospheric water harvesting, Adv. Mech. Eng., 2022, 14, 16878132221082768 CrossRef CAS .
Y. Meng, Y. Dang and S. L. Suib, Materials and devices for atmospheric water harvesting, Cell Rep. Phys. Sci., 2022, 3(7), 100976 CrossRef .
M. O. Alhurmuzi, M. Danışmaz and O. A. Zainal, Investigation of silica gel performance on potable water harvesting from ambient air using a rotatable apparatus with a solar tracking system, Desalin. Water Treat., 2023, 304, 12–24 CrossRef CAS .
A. LaPotin, H. Kim, S. R. Rao and E. N. Wang, Adsorption-based atmospheric water harvesting: impact of material and component properties on system-level performance, Acc. Chem. Res., 2019, 52, 1588–1597 CrossRef CAS PubMed .
Z. Zheng, H. L. Nguyen, N. Hanikel, K. K.-Y. Li, Z. Zhou, T. Ma and O. M. Yaghi, High-yield, green and scalable methods for producing MOF-303 for water harvesting from desert air, Nat. Protoc., 2023, 18, 136–156 CrossRef CAS PubMed .
W. Xu and O. M. Yaghi, Metal–organic frameworks for water harvesting from air, anywhere, anytime, ACS Cent. Sci., 2020, 6, 1348–1354 CrossRef CAS PubMed .
N. Hanikel, M. S. Prévot and O. M. Yaghi, MOF water harvesters, Nat. Nanotechnol., 2020, 15, 348–355 CrossRef CAS .
N. Hanikel, X. Pei, S. Chheda, H. Lyu, W. Jeong, J. Sauer, L. Gagliardi and O. M. Yaghi, Evolution of water structures in metal-organic frameworks for improved atmospheric water harvesting, Science, 2021, 374, 454–459 CrossRef CAS .
N. Alkhatib, N. Naleem and S. Kirmizialtin, How Does MOF-303 Achieve High Water Uptake and Facile Release Capacity?, J. Phys. Chem. C, 2024, 128, 8384–8394 CrossRef CAS .
N. Hanikel, D. V. Kurandina, S. Chheda, Z. Zheng, Z. Rong, S. E. Neumann, J. Sauer, J. I. Siepmann, L. Gagliardi and O. M. Yaghi, MOF Linker Extension Strategy for Enhanced Atmospheric Water Harvesting, ACS Cent. Sci., 2023, 9, 551–557 CrossRef CAS .
F. Luo, X. Liang, W. Chen, S. Wang, X. Gao, Z. Zhang and Y. Fang, High-efficient and scalable solar-driven MOF-based water collection unit: From module design to concrete implementation, Chem. Eng. J., 2023, 465, 142891 CrossRef CAS .
S. Li, P. Wu, L. Chen, Y. Tang, Y. Zhang, L. Qin, X. Qin and H. Li, Enhanced atmospheric water harvesting (AWH) by Co-based MOF with abundant hydrophilic groups and open metal sites, J. Water Process Eng., 2024, 58, 104899 CrossRef .
W. Gong, X. Chen, M. Wahiduzzaman, H. Xie, K. O. Kirlikovali, J. Dong, G. Maurin, O. K. Farha and Y. Cui, Chiral Reticular Chemistry: A Tailored Approach Crafting Highly Porous and Hydrolytically Robust Metal–Organic Frameworks for Intelligent Humidity Control, J. Am. Chem. Soc., 2024, 146, 2141–2150 CrossRef CAS .
G. Hu, Q. Liu and H. Deng, Space Exploration of Metal–Organic Frameworks in the Mesopore Regime, Acc. Chem. Res., 2024, 58, 73–86 CrossRef .
L. Li, Z. Shi, H. Liang, J. Liu and Z. Qiao, Machine Learning-Assisted Computational Screening of Metal-Organic Frameworks for Atmospheric Water Harvesting, Nanomaterials, 2022, 12, 159 CrossRef CAS .
Z. Zhang, H. Tang, M. Wang, B. Lyu, Z. Jiang and J. Jiang, Metal–organic frameworks for water harvesting: machine learning-based prediction and rapid screening, ACS Sustainable Chem. Eng., 2023, 11, 8148–8160 CrossRef CAS .
J. Burner, J. Luo, A. White, A. Mirmiran, O. Kwon, P. G. Boyd, S. Maley, M. Gibaldi, S. Simrod, V. Ogden and T. K. Woo, ARC–MOF: A Diverse Database of Metal-Organic Frameworks with DFT-Derived Partial Atomic Charges and Descriptors for Machine Learning, Chem. Mater., 2023, 35, 900–916 CrossRef CAS .
L. Ward, A. Agrawal, A. Choudhary and C. Wolverton, A General-Purpose Machine Learning Framework for Predicting Properties of Inorganic Materials, npj Comput. Mater., 2016, 2, 16028 CrossRef .
H. Furukawa, N. Ko, Y. B. Go, N. Aratani, S. B. Choi, E. Choi, A. Ö. Yazaydin, R. Q. Snurr, M. O'Keeffe and J. Kim, others Ultrahigh porosity in metal-organic frameworks, Science, 2010, 329, 424–428 CrossRef CAS PubMed .
S. Suginome, H. Sato, A. Hori, A. Mishima, Y. Harada, S. Kusaka, R. Matsuda, J. Pirillo, Y. Hijikata and T. Aida, One-step synthesis of an adaptive nanographene MOF: adsorbed gas-dependent geometrical diversity, J. Am. Chem. Soc., 2019, 141, 15649–15655 Search PubMed .
M. Pardakhti, E. Moharreri, D. Wanik, S. L. Suib and R. Srivastava, Machine learning using combined structural and chemical descriptors for prediction of methane adsorption performance of metal organic frameworks (MOFs), ACS Comb. Sci., 2017, 19, 640–645 CrossRef CAS PubMed .
D. Dubbeldam, S. Calero, D. E. Ellis and R. Q. Snurr, RASPA: molecular simulation software for adsorption and diffusion in flexible nanoporous materials, Mol. Simul., 2016, 42, 101–181 CrossRef .
J. L. F. Abascal and C. Vega, A general purpose model for the condensed phases of water: TIP4P/2005, J. Chem. Phys., 2005, 123, 234505 CrossRef CAS .
A. K. Rappé, C. Casewit, K. S. Colwell, W. A. Goddard and W. M. Skiff, UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations, J. Am. Chem. Soc., 1992, 114, 10024–10035 CrossRef .
C. Campañá, B. Mussard and T. K. Woo, Electrostatic Potential Derived Atomic Charges for Periodic Systems Using a Modified Error Functional, J. Chem. Theory Comput., 2009, 5(10), 2866–2878 CrossRef .
L. J. Saunders, R. A. Russell and D. P. Crabb, The coefficient of determination: what determines a useful R2 statistic?, Invest. Ophthalmol. Visual Sci., 2012, 53, 6830–6832 CrossRef .
F. Pedregosa, et al., Scikit-learn: Machine Learning in Python, J. Mach. Learn. Res., 2011, 12, 2825–2830 Search PubMed .
S. M. Lundberg and S.-I. Lee, A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017, pp. 4768–4777 Search PubMed.
J. A. Mason, J. Oktawiec, M. K. Taylor, M. R. Hudson, J. Rodriguez, J. E. Bachman, M. I. Gonzalez, A. Cervellino, A. Guagliardi and C. M. Brown, others Methane storage in flexible metal–organic frameworks with intrinsic thermal management, Nature, 2015, 527, 357–361 CrossRef CAS PubMed .
L. Li, H. S. Jung, J. W. Lee and Y. T. Kang, Review on applications of metal–organic frameworks for CO2 capture and the performance enhancement mechanisms, Renewable Sustainable Energy Rev., 2022, 162, 112441 CrossRef CAS .
K. Ravin, P. Sarver, B. Dinakar, L. Palatinus, P. Muuller, J. Oppenheim and M. Dincaa, High-Connectivity Triazolate-Based Metal–Organic Framework for Water Harvesting, J. Am. Chem. Soc., 2025, 147, 11407–11411 CrossRef CAS .
A. J. Rieth, A. M. Wright, G. Skorupskii, J. L. Mancuso, C. H. Hendon and M. Dincă, Record-Setting Sorbents for Reversible Water Uptake by Systematic Anion Exchanges in Metal–Organic Frameworks, J. Am. Chem. Soc., 2019, 141, 13858–13866 CrossRef CAS .
N. C. Burtch, H. Jasuja and K. S. Walton, Water stability and adsorption in metal–organic frameworks, Chem. Rev., 2014, 114, 10575–10612 CrossRef CAS PubMed .
H. Tan and G. Shan, Computational screening and functional tuning of chemically stable metal organic frameworks for I2/CH3I capture in humid environments, iScience, 2024, 27(3), 109096 CrossRef CAS .

Footnote

† Equal contribution.

Click here to see how this site uses Cookies. View our privacy policy here.