Open Access Article
Xiaoyu
Wu
and
Jianwen
Jiang
*
Department of Chemical and Biomolecular Engineering, National University of Singapore, Singapore, 117576, Singapore. E-mail: chejj@nus.edu.sg
First published on 4th April 2025
Recent advances in large language models (LLMs) offer a transformative paradigm for data-driven materials discovery. Herein, we exploit the potential of LLMs in predicting the hydrophobicity of metal–organic frameworks (MOFs). By fine-tuning the state-of-the-art Gemini-1.5 model exclusively on the chemical language of MOFs, we demonstrate its capacity to deliver weighted accuracies that surpass those of traditional machine learning approaches based on sophisticated descriptors. To further interpret the chemical “understanding” embedded within the Gemini model, we conduct systematic moiety masking experiments, where our fine-tuned Gemini model consistently retains robust predictive performance even with partial information loss. Finally, we show the practical applicability of the Gemini model via a blind test on solvent- and ion-containing MOFs. The results illustrate that Gemini, combined with lightweight fine-tuning on chemically annotated texts, can serve as a powerful tool for rapidly screening MOFs in pursuit of hydrophobic candidates. Taking a step forward, our work underscores the potential of LLMs in offering robust and data-efficient approaches to accelerate the discovery of functional materials.
000 MOFs have been experimentally produced; however, many of them were synthesized without reporting their hydrophobicity.4 To streamline the identification of hydrophobic MOFs, Henry's constant of water (KH) was adopted as a metric in a computational workflow5 to duly benchmark against representative hydrophobic structures (e.g., ZIF-8
6). This approach has been integrated with high-throughput computational screening to shortlist promising candidates for various applications (e.g., CO2 capture and removal of toxic gases).7,8 Nevertheless, in principle, the combinatorial design space of MOFs is infinite due to the myriad coupling chemistries of building units as well as their underlying topologies. As such, KH calculations to identify hydrophobic MOFs on a trial-and-error basis are exceedingly laborious.9
Machine learning (ML) has emerged as a powerful alternative in this regard, as it has already proven to be indispensable in the design of functional materials.10 One of the most compelling advances in ML is the advent of large language models (LLMs) such as ChatGPT11 and Gemini,12 which have been trained on massive text corpora spanning diverse disciplines.13 These foundational LLMs excel at generating language textual responses from simple prompts, which in many instances, are indistinguishable from human articulations. Such generative capabilities have unleashed exciting opportunities for digital chemistry including chemical synthesis,14–16 dataset mining,17–19 and pattern recognition.20
One of the fascinating aspects of LLMs lies in their predictive capacity in both forward and inverse chemical discovery, relying solely on chemical language instead of engineered molecular descriptors.21 Though typically pretrained for general purposes, LLMs can significantly enhance their predictive accuracy for chemistry-specific tasks through fine-tuning with domain knowledge even based on light-weight LLMs (i.e., LLMs pretrained with fewer parameters, e.g., 8B, 70B).22 Molecular representations including simplified molecular input line entry systems (SMILES)23 and self-referencing embedded strings (SELFIES)24 have facilitated chemical language modeling.25,26 These chemical notations can be further augmented with metal information, thereby capturing both the inorganic and organic constituents of MOFs.27,28 As exemplified in Scheme 1, a typical MOF named Cu-BTC29 can be rendered in augmented SMILES and SELFIES notations, each meticulously tokenized into smaller units carrying unique IDs for LLM ingestion. Notably, variably pretrained LLMs may adopt different tokenization approaches, and the default tokenizer of Gemini is employed here.
![]() | ||
| Scheme 1 Tokenized SMILES and SELFIES strings encoding Cu-BTC for Gemini. The tokenization is visualized via LLM-text-tokenization: https://github.com/glaforge/llm-text-tokenization. | ||
In this work, we embark on the fine-tuning of a cutting-edge LLM, Gemini-1.5,30 leveraging the latest CoRE-2024 database with thousands of “computation-ready” experimental MOFs.28 Particularly, we focus on MOFs from the all-solvents-removed (ASR) subset of single metal and single linker type to fine-tune the base model. For both binary and quaternary classifications of hydrophobicity, the fine-tuned Gemini achieves comparable overall accuracy and notably excels in weighted accuracy—a critical advantage given the class imbalance inherent in the large dataset. Furthermore, we assess the robustness and transferability of the model through moiety masking experiments and a rigorous blind test on distinct MOFs. These findings demonstrate the coherence between LLMs and digital chemistry, potentially shedding light on harnessing the power of Gemini as a useful agent for open questions in the broader physical sciences.
:
20 ratio, with the former as a training set for model development and the latter as a hold-out test set for model evaluation. Despite non-significant bias across any other labels in the training set compared to the test set, we noticed a notable imbalance in the distribution of labels, with SS being the least populated (Fig. 1b). This is not unexpected due to the challenging water-repelling nature of MOFs,32 which may pose difficulty in the model prediction. Generally, class imbalance remains a core challenge in applying ML to diverse research topics in physical sciences, including MOF discovery33 and photocatalyst design,34 as such tasks are not always exhaustively addressed with solely hand-crafted descriptors.35
![]() | ||
| Fig. 1 (a) Kernel density estimations of quaternary classification (SS, S, W and SW) versus density. The vertical axis denotes probability. (b) Data distribution in training and test sets. | ||
In this work, we fine-tuned the “gemini-1.5-flash-001-tuning” base model in Google AI Studio. The training dataset comprised labeled MOFs, alongside two molecular representations: SMILES and SELFIES, both augmented with inorganic motifs (hereafter referred to simply as SMILES and SELFIES). During fine-tuning, the training data were structured as prompt and response pairs: (“Representation”, “Label”). In such a format, SMILES or SELFIES served as prompts and were completed with corresponding hydrophobicity labels, which were categorized as [0, 1] and [0, 1, 2, 3] for binary and quaternary classifications, respectively. For computational feasibility and model stability, the model was tuned with 3 epochs and a batch size of 16 with a learning rate of 2 × 10−4 to reach a minimum loss. Due to the general knowledge stored in Gemini, a predicted response from an out-of-sample prompted MOF may not always be expected to predict an ideal label. For these MOFs, augmented prompts were utilized, as detailed in Section S2 in the ESI.†
The fine-tuned Gemini for prediction of MOF hydrophobicity was benchmarked against descriptor-based supervised ML models. Global pore descriptors (e.g., pore size) computed via Zeo++37 and the revised autocorrelations (RACs)38,39 were adopted for featurizing MOFs (detailed in Tables S1 and S2†). All the baseline models were trained using Support Vector Machine (SVM), with hyperparameters: {‘C’: [0.1, 1, 10], ‘kernel’: [‘linear’, ‘rbf’]} tuned through 10-fold cross-validations on the same training set as in the fine-tuned Gemini, with the identical hold-out test set and blind test set to ensure fair comparison.
| Model | Binary classification | Quaternary classification | ||
|---|---|---|---|---|
| Accuracya | F1-scoreb | Accuracy | F1-score | |
| a For binary classification, the accuracy metric reflects the model's ability to correctly classify instances into two labels (Strong and Weak). For quaternary classification, the accuracy metric aggregates the overall correct four labels (SS, S, W and SW). b The weighted F1-score combines precision and recall into a single metric, accounting for class imbalance by weighing each class's F1-score by the proportion of instances in that class. c The descriptor-based models were trained using the SVM classifier in scikit-learn, with hyperparameters: [‘C’: [0.1, 1, 10], ‘kernel’: [‘linear’, ‘rbf’]] tuned through 10-fold cross-validation. | ||||
| Gemini (SMILES) | 0.78 | 0.74 | 0.73 | 0.70 |
| Gemini (SELFIES) | 0.73 | 0.73 | 0.71 | 0.67 |
| Porec | 0.76 | 0.66 | 0.72 | 0.62 |
| Pore + RACs | 0.77 | 0.71 | 0.75 | 0.64 |
Descriptor-based ML models were used to benchmark the performance of the fine-tuned Gemini. For effective featurization, we adopted pore descriptors and RACs. Leveraging non-weighted molecular graphs to derive the products or differences of atomic heuristics, RACs have been used to effectively map the chemical space of MOFs, encompassing linker and metal chemistry.43,44 Combined with pore descriptors, RACs have been shown to be effective and robust in predicting various properties of MOFs.45–49 As illustrated in Table 1, despite being trained upon a simple text-based representation like SMILES, the fine-tuned Gemini has predictive capability comparable with that of sophisticated descriptor-based models, with a slight underestimation of overall accuracy for quaternary classification. This is expected, to a certain extent, as RACs embed both similarities and dissimilarities, thus better encapsulating chemical topology and connectivity as well.47 Significantly, our fine-tuned Gemini exhibits commendable weighted accuracy, maintaining a weighted F1-score of 0.70 even as the classification complexity increases from binary to quaternary. As depicted in Fig. 2a, the fine-tuned Gemini correctly distinguishes most of the “Strong” labels (376 in the bottom-left cell), which is slightly less predicted than Pore + RACs (Fig. S5a†). Conversely, Pore + RACs fail to predict their counterparts labelled as “Weak”, with only 15 correct predictions and 111 samples overpredicted as “Strong”. Such mislabeling in “Weak” hydrophobic MOFs is more pronounced for quaternary classification (Fig. S5b†). Despite offering a finer-grained picture of misclassification across the four labels, the fine-tuned Gemini outperforms in predictions of “SS”, “W” and “SW” (Fig. 2b), leading to a higher weighted F1-score. To further evaluate the consistency of fine-tuning results, we examined the effect of a random state in the training/test split of the dataset, all of which yielded stable model performance (±0.01), as summarized in Table S3.† These results position the fine-tuned Gemini as an efficient tool for sorting a large number of MOFs for subsequent, time-consuming computations (e.g., via first-principles molecular dynamics simulation50) to more precisely determine hydrophobicity. While descriptor-based ML models demonstrate better predictive capability through advanced feature engineering, they typically necessitate extensive preprocessing or the derivation of specialized property-specific descriptors that rely heavily on precise atomic positions. In contrast, our LLM-based method exclusively leverages text-based chemical representations (i.e., SMILES/SELFIES) that encode the building units in MOFs. These engineering-free string-based representations require no structural optimization, thus facilitating rapid screening across a diverse and potentially unexplored topology space without exhaustive structural validation. We should note that our method is not intended to replace direct KH calculations (e.g., via the Widom insertion method); rather, it serves as a rapid, surrogate screening tool capable of efficiently narrowing down a large candidate pool. Nevertheless, we note that such a LLM based model may not be ideal for near-quantitative predictions, where regression models might deliver higher accuracy.51,52
![]() | ||
| Fig. 2 Confusion matrices on the test set by the fine-tuned Gemini. (a) Binary classification and (b) quaternary classification. | ||
Acquiring high-fidelity data through an experimental or computational approach can be time-consuming and costly. Ideally, a model should maintain data efficiency, even trained on a limited budget of data.53,54 In this context, we fine-tuned Gemini using various training set ratios, ranging from 0.2 to 0.8 out of 2112 total training data points to assess its data efficiency. For fair comparison, the ML model with Pore + RACs was also trained on the same classification tasks using the same training data as in Gemini. The learning curves on the test set by the fine-tuned Gemini and the ML model are presented in Fig. 3a. Intriguingly, both models demonstrate similar accuracy across a wide range of training set ratios, with predictive performance significantly compromised when trained on less than 845 data points (i.e., 0.4 training set ratio). Such an effect is markedly amplified for quaternary classification, where the model performance drops below random guessing (<0.5 accuracy). The accuracy of both models saturates at a training set ratio of 0.6, with marginal improvement afterwards. The optimal accuracy scores of 0.78 and 0.73 are achieved by the fine-tuned Gemini for binary and quaternary classifications, respectively. Such an early performance ceiling likely stems from the complexity of the hydrophobic nature in reticular chemistry, a subtle property governed by a complex interplay that is difficult to fully encapsulate. Discretizing hydrophobicity into binary or quaternary classification may also introduce discontinuous boundaries for KH that challenge the capability of model prediction.55
Prompted by the proven data efficiency, we examined the dissimilarity of the feature space encoded by Pore + RACs and SMILES, respectively, through t-distributed stochastic neighbor embedding (t-SNE).56 In a t-SNE map, points are spatially arranged such that the closer the two points, the more similar the two structures are, as described by the encoding fingerprints. The SMILES representation, as text-based input, was tokenized through the BERT model57 to emulate the enclosed dimensions captured by the fine-tuned Gemini. As evidenced in Fig. 3b and c, increasing the training set ratio leads to a progressively denser and more saturated feature space. The dense pattern stabilizes after a training set ratio of 0.6, highlighting an optimal balance between training set size and predictive performance. We acknowledge the challenge presented by the SS label in the quaternary classification even with a full training set (1.0), which notably yields several misclassifications in the fine-tuned Gemini model predictions. To interpret this difficulty, we examined the chemical space in both trained and tested MOFs labeled as SS and S. As clearly illustrated in Fig. S6,† the chemical space coverage for SS in the training set is significantly diluted compared to that for S. This limited chemical diversity contributes to the difficulty of the fine-tuned Gemini in accurately distinguishing closely overlapping chemical syntax, leading to multiple SS-labeled MOFs being misclassified as S in the test set. We anticipate that enriching the chemical diversity, particularly for the underrepresented SS label, could substantially enhance the predictive accuracy of the fine-tuned Gemini for these challenging instances. As such, the fine-tuned Gemini appears to be an effective approach for data-driven discovery of MOFs, as it requires minimal efforts in preparing input (simple as a string-based representation as SMILES), compared to laborious feature engineering efforts for descriptor-based ML models. Despite this simplicity, the fine-tuned Gemini is on par with descriptor-based ML in terms of overall accuracy but achieves higher weighted accuracy, underscoring its potential to balance predictive power with practical implementations.
![]() | ||
| Fig. 4 Schematic illustration of a moiety masking experiment, where Cu is replaced with a <missing> annotation. Color scheme: Cu, brown; O, red; C, grey; H, white. | ||
Unlike a metal identity, a linker chemical moiety may exhibit varied forms in a SMILES representation; we thus adopted SMILES arbitrary target specification (SMARTS)58 to locate and identify the substructures of intended linker chemical moieties in SMILES strings. As illustrated in Table 2 and Fig. 5, we considered a list of 9 linkers, which were identified by SMARTS substructure matching using the RDKit package.59 Each of these linker chemical moieties was subsequently subjected to a moiety masking experiment to assess model robustness. As previously discussed, the fine-tuned Gemini maintains both data efficiency and accuracy at a training set ratio of 0.6. To ensure a diverse coverage of linker chemistry, the base model was fine-tuned anew using 1268 data points from the full training data set (i.e., at a training set ratio of 0.6), with the remaining 845 data points combined with 529 data points from the original test set for out-of-sample predictions. The moiety masking experiments were conducted on the data points correctly predicted in this new out-of-sample test, resulting in 1019 and 962 SMILES representations to be tested for binary and quaternary classifications, respectively.
| Moiety | SMARTS | Binary | Quaternary | ||
|---|---|---|---|---|---|
| No. of tests | Accuracy | No. of tests | Accuracy | ||
| Metal | — | 1019 | 0.96 | 962 | 0.92 |
| Aromatic N | [n; a] | 317 | 0.97 | 299 | 0.94 |
| Alkyne | [C#C] | 10 | 1.00 | 10 | 1.00 |
| Imine | [$([CX3]([#6])[#6]),$([CX3H][#6])] = [$([NX2][#6]),$([NX2H])] | 18 | 1.00 | 14 | 1.00 |
| Halogen | [F,Cl,Br,I] | 69 | 0.99 | 63 | 0.95 |
| Ether | [$([CX4]),$([cX3])]O[$([CX4]),$([cX3])] | 75 | 0.97 | 70 | 0.92 |
| Amino | [OX1] = CN | 30 | 1.00 | 22 | 0.87 |
| Enamine | [NX3][$(C C),$(cc)] |
76 | 0.93 | 71 | 0.89 |
| Ketone | [#6][CX3](=O)[#6] | 16 | 1.00 | 15 | 1.00 |
| Alkene | [C C] |
59 | 0.92 | 49 | 0.88 |
| Total | — | 1689 | 0.96 | 1576 | 0.92 |
As captured in Table 2, across 10 probed chemical moieties, the fine-tuned Gemini retains the majority of its predictive capacity, achieving a total accuracy of 0.96 for binary classification. For quaternary classification, a slightly lower accuracy of 0.92 is observed, which aligns with the challenging nature of skewed data distributions as discussed in Section 2.1. Among the chemical moieties, masking amino and alkene groups results in the greatest prediction disagreement for quaternary classification, indicating a relatively high importance attributed by the fine-tuned Gemini. However, these moieties have a relatively small sample size (22 and 49 for amino and alkene groups, respectively), suggesting that the low agreement could partially stem from the limited data diversity of the test set. One plausible interpretation is that functionalities such as amino groups exert a notable influence on MOF hydrophobicity. It is essential to acknowledge that the outcome of moiety masking experiments might vary depending on the dataset utilized for fine-tuning, thus making our conclusions dataset-specific. Nevertheless, these findings imply that the fine-tuned Gemini is capable of withstanding minor information disruptions in text prompts, reflecting its robustness.
![]() | ||
| Fig. 6 Confusion matrices by the fine-tuned Gemini on 300 randomly selected MOFs from (a) FSR subset and (b) ION subset. | ||
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5ta01139f |
| This journal is © The Royal Society of Chemistry 2025 |