Open Access Article
Seunghee Han
,
Taeun Bae
,
Junho Kim
,
Younghun Kim
and
Jihan Kim
*
Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea. E-mail: jihankim@kaist.ac.kr
First published on 7th April 2026
Porous materials such as metal–organic frameworks (MOFs), covalent organic frameworks (COFs), zeolites, and porous carbons play central roles in gas storage, separation, catalysis, and environmental technologies. However, their design and discovery remain resource-intensive, relying heavily on expert intuition and fragmented knowledge distributed across the literature. Recent advances in large language models (LLMs) present new opportunities to accelerate these workflows by integrating scientific text mining, domain reasoning, and experimental planning. In this review, we outline the emerging role of LLMs across the porous materials research ecosystem. We first introduce the foundations of LLMs, followed by a discussion of NLP-based text mining for literature analysis. We then examine LLM adaptation including prompt engineering and fine-tuning, and autonomous research systems from human-in-the-loop to self-driving laboratories. For each domain, we summarize how LLM architectures are integrated with research systems, highlighting their applications, advantages, and limitations. Additionally, we discuss the current challenges of applying LLMs to porous materials, trade-offs between prompt engineering and fine-tuning, the influence of generation parameters such as temperature, and safety considerations in autonomous laboratory systems. Finally, we expect LLMs to advance toward multimodal reasoning, tighter integration with structured knowledge bases, and safer autonomous experimental workflows. Together, these developments suggest emerging LLM-driven paradigms that could transform the conceptualization, design, and synthesis of porous materials.
While traditional ML models excel at numerical property prediction,17,18 they do not directly leverage the vast amount of textual knowledge embedded in the literature. Recent advances in natural language processing (NLP) have enabled extraction of synthesis procedures, properties, and reaction conditions from unstructured scientific text, forming the basis for text-driven knowledge discovery.19,20 More recently, large language models (LLMs) have expanded this direction by enabling not only information extraction but reasoning, summarization, decision-making, and multi-step workflow design.21 Unlike previous approaches limited to quantitative structure–property mapping, LLMs enable integration of unstructured scientific knowledge with formal data representations. This capability expands their role from simple predictive modeling to hypothesis generation and knowledge synthesis in materials discovery.
These methodological advances are now beginning to reshape porous materials research specifically, where LLMs have been applied to tasks ranging from text mining to autonomous experimentation. Notably, most reported applications to date have focused on MOF systems, with more limited exploration of COFs, porous carbons, and zeolites, reflecting the current distribution of available studies. Early applications of LLMs in porous materials primarily focused on text mining, enabling the construction of synthesis databases and automated entity extraction.22 As the field progressed, LLMs increasingly shifted toward generative and decision-making roles, including inverse design, synthesis strategy recommendation, RAG-assisted reasoning, and multi-agent orchestration.23 More recent work integrates LLMs with simulation tools and robotic experimentation, demonstrating early examples of closed-loop discovery pipelines and autonomous laboratory systems.24 Collectively, these developments reflect a shift from passive language comprehension toward active participation in scientific reasoning and experimental planning.25
In this work, we provide a structured overview of how LLMs are being used across the porous materials research ecosystem. We first introduce core concepts and reasoning frameworks including the evolution of LLMs, prompt engineering,26 chain-of-thought reasoning (CoT),27 and retrieval augmented generation (RAG).28 We then examine three major application domains: (1) natural language processing (NLP) for text mining, (2) LLM adaptation with prompt engineering, and fine-tuning and (3) autonomous systems progressing from human-in-the-loop frameworks toward self-driving laboratories (Fig. 1). Furthermore, we discuss current limitations, the trade-offs between prompt engineering and fine-tuning approaches, the role of temperature in the reliability of LLM-driven workflows, and safety considerations in autonomous laboratory systems. Finally, we outline future research directions. Together, these perspectives position LLMs not merely as computational tools but as emerging cognitive systems capable of connecting language, domain reasoning, and experimental execution. As multimodal modeling, structured knowledge representation, and autonomous experimentation advance, LLM-enabled workflows are expected to play an increasingly central role in how porous materials are conceptualized, designed, and synthesized.
This shift opened the door for large-scale models and achieved their rapid expansion into different regimes,30–32 where their ability to interpret, predict, and generate properties offers significant advantages. A key reason for this advancement is the Transformer's self-attention mechanism, which allows the model to examine all tokens in a sentence simultaneously rather than sequentially.29 Through attention scores, Transformers capture long-range context, resolve complex dependencies, and represent nuanced relationships that earlier architectures struggled to encode. In particular, the transformer's capacity to recognize and encode complex linguistic patterns provides the foundation for how LLMs internalize scientific knowledge from literature and capture the relationships inherent in sequential, symbolic, and domain-specific data.
These Transformer-based models are pre-trained on massive amounts of text to learn linguistic structure and contextual relationships. After broad pretraining, they can be adapted to specialized downstream tasks through several complementary strategies. One such method is fine-tuning, in which a pre-trained model is trained further using curated literature and datasets relevant to specific scientific objectives.33–35 Through this refinement, the model transforms a general-purpose architecture into a task-specialized assistant.
As model scale increased with systems such as GPT-2 and GPT-3, researchers discovered that LLMs could perform new tasks simply by conditioning them with natural-language instructions. This instruction-following capability is known as prompt engineering, a method that guides model behavior through carefully structured prompts without modifying model parameters. Because large pretrained models already encode broad scientific priors, prompt engineering often achieves strong performance without the computational cost of fine-tuning.
Complementing prompt-based control are advanced reasoning strategies designed to make the model's internal logic more transparent. Chain-of-Thought (CoT) prompting encourages models to articulate intermediate reasoning steps,27 while Chain-of-Verification (CoV) instructs them to reevaluate their own conclusions by generating and answering verification questions to correct potential errors.36 However, a fundamental limitation remains: these internal strategies fail when intermediate reasoning requires specific domain knowledge not reliably captured during pre-training. In chemistry, this often leads to logical hallucinations where the model generates plausible-sounding but scientifically flawed argumentation for niche reaction mechanisms or complex structural relationships. Building on these methods, the ReAct framework integrates explicit reasoning with action-taking by allowing the model to alternate between thinking and acting, which not only improves task performance but also provides more grounded intermediate reasoning steps in observable outcomes. Whereas CoT emphasizes internal reasoning transparency, ReAct introduces an explicit decision layer that links reasoning to tool use or external interaction. In parallel, the MRKL (Modular Reasoning, Knowledge, and Language) system addresses which module performs the task by routing queries to specialized expert modules or external tools, enabling more structured and reliable reasoning pipelines. The practical utility of these reasoning frameworks is illustrated in recent porous materials research. For instance, ReAct-style strategies enable iterative refinement of experimental workflows through interaction with simulated or real-world feedback, whereas MRKL-inspired architectures facilitate modular routing of queries to specialized computational or database tools. Rather than relying solely on internal parametric reasoning, such tool-integrated approaches allow intermediate steps to be supported by external calculations and structured operations. Importantly, the suitability of each framework remains inherently task-dependent. ReAct-style approaches are particularly useful when iterative interaction with tools or feedback is required to guide intermediate decisions, whereas MRKL-style architectures are advantageous when problem solving benefits from explicit decomposition into specialized computational or database modules. Detailed implementations of these strategies in porous materials systems are discussed in later sections. More broadly, these strategies are especially valuable in scientific research, where interpreting certain phenomena or structure–property relationships require multi-step argumentation.37–39
In the context of LLM-enabled scientific workflows, the terms verification and validation are often used interchangeably but represent conceptually distinct processes. In this review, verification refers to internal consistency checks within computational workflows, including schema enforcement, reasoning self-checks (e.g., chain-of-verification), and cross-database grounding. Validation, in contrast, denotes the assessment of scientific correctness and real-world applicability, often requiring expert review or experimental confirmation. While earlier sections may have used these terms more loosely, we here adopt a clear distinction between internal verification procedures and external scientific validation, particularly as workflows progress toward autonomous systems.
Despite their capabilities, LLMs are prone to hallucinations, a phenomenon in which the model confidently generates statements that appear plausible but are factually incorrect or unsupported by data. This challenge becomes especially serious in scientific contexts, where experimental conditions or structural descriptors may appear only in isolated publications or supplementary data files.40 Retrieval augmented generation (RAG) was developed in direct response to these limitations.28 By grounding its answers in verified sources, RAG significantly reduces hallucination and improves factual consistency in chemical reasoning. RAG's performance is strictly governed by boundary conditions, such as the quality of the knowledge base and the retriever's precision with domain-specific terminology. Common failure modes in scientific workflow, which often compromise this precision, include retrieval errors, where irrelevant documents are fetched, and integration errors, where the model defaults to internal priors despite the provided context. Addressing these retrieval and integration failures is therefore essential for maintaining faithfulness and robustness in knowledge-intensive chemical discovery.
Together, these techniques form a coherent ecosystem that governs how LLMs acquire domain knowledge, interpret scientific information, and maintain reliability. Given the complex, multiscale datasets, specialized terminology, and highly fragmented reporting across the porous materials community, such reasoning-oriented and retrieval-grounded approaches are increasingly essential.
The rapid evolution of LLMs has produced a diverse landscape of powerful models, including GPT-4, Claude, Llama, Nemotron, Mistral, ChatGLM, Falcon, and DeepSeek.41–49 These systems differ significantly in accessibility: GPT-4 and Claude are proprietary commercial models available through paid APIs, whereas Llama (Meta), Mistral, ChatGLM2-6B, Falcon-7B, Nemotron (NVIDIA), and DeepSeek offer open-source or openly licensed weights that support local deployment and customization. In addition, models like OpenAI's GPT-4V and Qwen-VL introduce multimodal capabilities, enabling the model to interpret and reason over both images and text, thereby broadening its utility across scientific and technical applications.50 Their increased multimodal capacity and stronger reasoning abilities have made them particularly suitable for scientific discovery, where text, numerical descriptors, and experimental data must be interpreted together. For instance, these models can perform visual reasoning on SEM or TEM micrographs to identify morphological defects or pore-filling patterns that are difficult to encode into traditional descriptors. By interpreting these visual cues, LLMs can provide closed-loop feedback for synthesis planning, autonomously suggesting adjustments to reaction conditions based on the visual confirmation of a product's structural integrity. This transition from passive image recognition to active visual reasoning represents a functional capability that traditional ML predictors lack, positioning multimodal LLMs as essential decision-makers in autonomous laboratory environments. As these advanced architectures have matured, they have accelerated a broader methodological shift from traditional text mining pipelines to multimodal, retrieval-grounded, and reasoning-aware computational workflows.33–35 This transition reflects a fundamental change: LLMs are no longer tools for passive information extraction but active engines that support hypothesis generation, decision-making, and integrated scientific analysis.
In the following section, we explore how language models have been applied in materials science, followed by an in-depth examination of their applications in porous materials. An overview of representative studies is summarized in Table 1, which includes information such as publication year, material system, model, method, key findings, and limitations. To ensure transparency, the studies in Table 1 were selected via Google Scholar using keywords including ‘LLM,’ ‘text-mining,’ ‘porous material,’ and ‘self-driving lab.’ This selection represents key integration cases, with individual references provided for further detail. In our effort to provide a structured overview in Table 1, we have categorized the limitations of existing studies into standardized themes such as data dependency, limited generalization, prompt sensitivity, reliability and hallucination risks, human-in-the-loop dependence, and automation, scalability constraints, or scope limitation. However, it is important to note that establishing a perfectly uniform categorization remains challenging at this stage. This difficulty arises primarily from the high degree of heterogeneity in current research, where the lack of standardized reporting for prompt configurations and the frequent use of non-public, proprietary datasets hinder direct cross-study comparisons.
| System | Year | Material | LLM model | Prompt engineering or finetuning | Key findings | Limitation |
|---|---|---|---|---|---|---|
| CCA51 | 2023 | MOF | GPT-3.5-Turbo, GPT-4 | Prompt engineering | Prompt-only LLM-based extraction of MOF synthesis data with high accuracy | Data dependency: only for well-formatted text |
| Paragraph2MOFInfo33 | 2024 | MOF | GPT-3.5-Turbo, GPT-4, Mistral-7B, Llama-2, Llama-3, T5, BART | Prompt engineering, finetuning | Fine-tuned LLM-based extraction of MOF synthesis and property information | Hallucination & reliability risk: limited coverage of complex extraction targets |
| GPT-4V Image Mining52 | 2024 | MOF | GPT-4V | Prompt engineering | Multimodal LLM-based mining of MOF characterization data from figures | Prompt sensitivity: strong sensitivity to prompt design and category definition |
| L2M3 (ref. 53) | 2025 | MOF | GPT-3.5-Turbo, GPT-4 | Prompt engineering, finetuning | Data-driven MOF synthesis condition recommendation using LLM-extracted databases | Hallucination & reliability risk: extraction accuracy constrained by LLM performance |
| Porous Carbon Mining54 | 2025 | Porous carbon | ChatGPT 4.0 API | Prompt engineering | Integrated LLM-AutoML framework for inverse design of porous carbons | Hallucination & reliability risk: requirement for post-extraction data curation and verification |
| SYN-COF55 | 2025 | COF | Deepseek-R1 | Prompt engineering | LLM-based COF synthesis extraction and ML prediction with experimental validation | Limited generalization: limited to literature-rich dual-monomer solvothermal COFs |
| NERRE Extractor56 | 2024 | MOF | GPT-3, Llama-2 | Finetuning | Fine-tuned LLM extraction of hierarchical and relational scientific information | Hallucination & reliability risk: formatting inconsistency |
| Eunomia57 | 2024 | MOF | GPT-4 | Prompt engineering | ReAct agent-based LLM system for materials data extraction | Prompt sensitivity: reliance on prompts and auxiliary tools |
| MOF-LLM Performance Benchmark58 | 2024 | MOF | Llama2-7B, ChatGLM2-6B, Vicuna-7B, Falcon-7B, Mistral-7B, Marcoroni-7B, Llama2-13B, Vicuna-13B | Prompt engineering | Systematic benchmarking of open-source LLMs for diverse MOF research tasks | Hallucination & reliability risk: insufficient MOF-specific domain knowledge without fine-tuning |
| RetChemQA59 | 2024 | MOF | GPT-4-Turbo | Prompt engineering | Large-scale single- and multi-hop QA benchmark for reticular chemistry | Scope limitation: restriction to literature-grounded question answering rather than material generation |
| LLM-Based Hydrophobicity Predictor60 | 2025 | MOF | Gemini-1.5 Flash | Finetuning | Text-based prediction of MOF hydrophobicity using fine-tuned LLMs | Limited generalization: weak for unseen solvent- or ion-containing MOFs |
| MOF Linker Mutation Model61 | 2023 | MOF | GPT-3.5-Turbo | Finetuning | Fine-tuned LLM generation of chemically valid MOF linker mutations | Human-in-the-loop dependency: need for human validation of chemical plausibility and synthetic feasibility |
| GPT-4 Reticular Chemist62 | 2023 | MOF | GPT-4 | Prompt engineering | LLM-human collaboration enabling the design of four new isoreticular MOFs | Hallucination & reliability risk: limited capability of GPT-4 in property assessment of MOFs requiring human expertise |
| MOFsyn agent63 | 2025 | MOF | GPT-4o, Deepseek-V3, GLM-4-Flash-250414, Qwen2.5-MAX | Prompt engineering | Stepwise reduction strategy proposed for catalyst performance optimization using LLM | Human-in-the-loop dependency: reliance on manual experimentation |
| OSDA Design64 | 2025 | Zeolite | GPT-4o, GPT-3.5-turbo, Llama 3.1, Llama 3.2, Nemotron-4 | Prompt engineering | OSDA distribution sampling with proposal of new high-affinity candidates | Hallucination & reliability risk: limited LLM capability in synthesizability assessment and synthesis pathway estimation for complex OSDAs |
| SciToolAgent65 | 2025 | MOF | GPT-4o, OpenAI-o1, Qwen2.5-72B | GPT-4: prompt engineering/Qwen: finetuning using LoRA | State-of-the-art performance in scientific tool evaluation with large-scale tool orchestration and integrated safety checks | Limited automation & scalability: due to manual knowledge graph construction, and reliance on GPT-4o |
| dZiner66 | 2024 | MOF | GPT-4o, Claude 3.5 Sonnet | Prompt engineering | LLM agent-driven inverse design from properties to structures | Data dependency: oversimplification of complex MOF |
| ChatMOF23 | 2024 | MOF | GPT-4, GPT-3.5-turbo, GPT-3.5-turbo-16k | Prompt engineering | Autonomous MOF search, property prediction, and inverse design enabled by ReAct/MRKL architecture | Prompt sensitivity: token-length limitations, hallucination & reliability risk: occasional reasoning failures, and reduced generative diversity during MOF generation |
| ChatGPT Research Group24 | 2023 | MOF, COF | GPT-4 | Prompt engineering | Integration of 7 AI agents with bayesian optimization for efficient MOF and COF crystallinity optimization | Limited automation & scalability: requiring more advanced robotic platforms |
| MOFGen67 | 2025 | MOF | GPT-4, Llama | Prompt engineering | Modular multi-agent framework combining LLMs, diffusion models, and QM agents with experimental realization of five AI-dreamt MOFs | Human-in-the-loop dependency: for exploring synthetic possibilities |
| Zn-HKUST-1 Green Synthesis68 | 2025 | MOF | ChatGPT | Prompt engineering | Sustainable MOF synthesis optimization via LLM-based planning and high-throughput pipetting robots | Limited automation & scalability, human-in-the-loop dependency: human intervention required in experimental workflows |
At a fundamental level, NLP-based text mining pipelines in materials science can be decomposed into three core processes: text preprocessing, text representation, and information extraction, each of which incorporates a distinct set of computational techniques and models.
Text preprocessing focuses on converting raw, unstructured text into linguistically analyzable units. This step typically includes sentence segmentation, tokenization, part-of-speech tagging, and syntactic or dependency parsing, which together provide a structural foundation for downstream analysis. In materials-oriented workflows, preprocessing further involves the identification of chemically relevant text segments, such as synthesis-related paragraphs, often using lightweight classification models including logistic regression. Auxiliary NLP tools such as ChemicalTagger71 and dependency parsers such as Stanza72 are also employed at this stage to recognize experimental actions and syntactic relationships between entities described in scientific text.
Text representation transforms preprocessed linguistic units into numerical embeddings that can be consumed by machine-learning models. Early materials text mining studies relied on static word embeddings such as Word2Vec,73 whereas more recent approaches increasingly adopt contextual language models, most notably bidirectional encoder representations from transformers (BERT).74 Domain-adapted variants, including PubMedBERT75 further pretrained on materials science corpora, are often used to capture the specialized semantics of chemical terminology and experimental descriptions, providing high-quality representations for downstream extraction tasks.
Information extraction constitutes the central component of NLP-driven text mining, where structured knowledge is derived from text. Named entity recognition (NER) is widely used to identify materials entities such as chemical compounds, precursors, solvents, synthesis conditions, and properties, commonly implemented using sequence-labeling architectures including BiLSTM–CRF76 networks or transformer-based encoders. To complement data-driven models, rule-based techniques such as regular expressions, keyword matching, domain-specific lexicons, and ontology-guided filters are extensively integrated to extract numerical values and synthesis descriptors that exhibit high linguistic variability. Beyond free-text content, table parsing algorithms are employed to extract property values and synthesis parameters reported in tabular formats. Chemistry-aware NLP frameworks such as ChemDataExtractor77 integrates many of these extraction strategies with document structure analysis and chemical entity recognition, enabling scalable and automated information extraction across large and heterogeneous literature collections.
Collectively, these three processes form a unified pipeline in which statistical learning, deep neural networks, and rule-based heuristics are combined to transform unstructured materials literature into structured, machine-readable datasets. These components support a wide range of NLP applications across materials domains. As an illustrative example, Shetty et al. presented an NLP pipeline for polymer property extraction by developing MaterialsBERT, obtained by further pre-training PubMedBERT on 2.4 million materials science abstracts.78 Applied to polymer literature, the system produced over 300
000 property records from 130
000 documents, demonstrating the scalability of NLP-driven data extraction in materials science. Kononova et al. developed one of the first large-scale NLP pipelines for inorganic synthesis extraction using a BiLSTM–CRF model, dependency parsing with neural networks, keyword-based matching, and a specialized material parser.79 The system produced 19
488 solid-state reactions involving 13
009 targets and 1845 precursors, establishing the first automated large-scale synthesis-pathway dataset. These examples highlight how NLP has been applied in materials science, and in the following section, we take a closer look at its use in porous materials through several representative studies.
Tayfuroglu et al. employed a large-scale text and data mining (TDM) workflow to investigate H2 uptake in MOFs.80 A rule-based NLP pipeline incorporating tokenization and keyword-based extraction was used to collect surface area (SA) and pore volume (PV) information from 58
700 publications, resulting in SA values for 5975 MOFs and PV values for 7481 MOFs. The NLP approach achieved accuracies of 78% for SA and 82% for PV. In parallel, theoretical SA and PV values were computed for 72
000 structures in the Cambridge Structural Database (CSD)81 using Zeo++.82 These theoretical descriptors were integrated with experimentally extracted TDM values to estimate H2 uptake, and the resulting predictions showed good agreement with grand canonical Monte Carlo (GCMC) simulations. Collectively, this study illustrates that rule-based NLP, when combined with structural modeling, offers a scalable and effective framework for data-driven evaluation of hydrogen storage performance in MOFs.
To expand beyond the extraction of numerical descriptors, a subsequent study incorporated chemistry-aware NLP toolkits and additional structural metadata to capture a broader range of synthesis-related information. Glasby et al. developed DigiMOF, an automatically generated database of MOF synthesis information obtained through large-scale text mining (Fig. 2a).83 Using the chemistry-aware NLP toolkit ChemDataExtractor,77 the workflow extracted key synthesis descriptors including synthesis methods, solvents, linkers, and metal precursors from 43
281 MOF-related publications, resulting in 15
501 unique MOFs and 52
680 synthesis-related property records. To enrich the dataset, additional metadata such as topological and geometric features were integrated using CrystalNets84 and Zeo++,82 enabling systematic connections between synthesis conditions and structural characteristics. The pipeline achieved a precision of approximately 77%, offering a robust, large-scale dataset suitable for data-driven studies of MOF synthesis. Overall, this work established a comprehensive digital infrastructure for the extraction and analysis of synthesis data in porous materials research.
![]() | ||
| Fig. 2 NLP-enabled text mining pipelines for automated data extraction and ML-driven synthesis prediction in MOF (a) large-scale literature mining to extract structured MOF synthesis information. Adapted from ref. 82 and licensed under CC-BY 4.0. (b) Literature-derived databases supporting statistical analysis and crystallization outcome prediction. Adapted from ref. 20. Copyright 2022 American Chemical Society (c) ML models trained on NLP-extracted data to predict synthesis conditions directly from MOF structures. Adapted from ref. 85 and licensed under CC-BY 4.0. | ||
Building on developments in MOF-focused data extraction, NLP-driven approaches have also been extended to other porous material families. Pan et al. developed ZeoSyn, a comprehensive zeolite synthesis dataset aimed at systematically mapping the large chemical space of zeolites.85 To construct the dataset, the authors implemented an NLP-driven pipeline that integrates table parsing, named entity recognition, regular expressions, and domain-specific keyword matching to extract synthesis parameters such as gel compositions, reaction conditions, inorganic precursors, organic structure-directing agents (OSDAs), and product frameworks from 3096 journal articles. Following extensive manual verification, the resulting dataset comprised 23
961 synthesis routes covering 233 zeolite topologies and 921 unique OSDAs. To illustrate the analytical utility of the dataset, the authors applied SHapley Additive exPlanations (SHAP) to assess how specific synthesis parameters influence the likelihood of forming particular zeolite frameworks. Altogether, this work established a rigorously curated and interpretable resource that supports data-driven investigation and informed design of zeolite synthesis.
Building on parameter-level synthesis datasets such as ZeoSyn, recent work has moved toward extracting complete experimental procedures from text. He et al. developed ZeoReader, an end-to-end information extraction framework for reconstructing structured zeolite synthesis steps directly from the literature.86 Rather than extracting isolated synthesis parameters, ZeoReader models synthesis as event-level sequences composed of modular actions such as add, stir, and crystallize together with associated properties including temperature, duration, pressure, and materials. The framework consists of PDF parsing, a MatSciBERT-based paragraph classifier for identifying synthesis-relevant passages, and a two-stage event extraction model. Action detection is formulated as trigger classification, while property extraction is implemented using a BART encoder–decoder model in which predefined action-specific templates, such as “add material to container at temperature”, are completed by filling the material, container, and temperature slots with text spans from the original sentence. To improve robustness in sentences containing multiple densely packed properties or bracketed quantities, the authors introduced contrastive learning by constructing correctly populated templates as positive samples and partially or incorrectly filled templates as negative samples. An Information Noise-Contrastive Estimation (InfoNCE) loss is used to bring representations of correct templates closer to the original instance and push incorrect ones apart, reducing omission and boundary errors during property extraction. The system achieved 94.06 percent accuracy for paragraph identification and an F1 score of 74.99 percent for property extraction. The authors note limitations including the inability to process tables or multimodal content, dependence on a predefined synthesis schema, and difficulty handling unseen action types or cross-sentence dependencies. Overall, ZeoReader extends NLP-driven extraction in porous materials from parameter aggregation toward structured, step-level synthesis knowledge.
Park et al. developed a large-scale literature-mined database of MOF synthesis information to enable data-driven prediction of crystallization outcomes (Fig. 2b).20 Their NLP pipeline combined logistic regression for classifying synthesis-relevant paragraphs, a BiLSTM–CRF76 model for named entity recognition of MOF names, precursors, and solvents, and rule-based regular expressions for extracting reaction conditions, including temperature and time. Applying this workflow to 28
565 MOF-related publications yielded 46
701 synthesis records. Because unsuccessful synthesis attempts are rarely reported in the literature, the authors trained a positive-unlabeled learning model to predict MOF crystallinity, termed the crystal score. The model achieved a recall of 83% and successfully distinguished reported amorphous MOFs from their crystalline counterparts in multiple case studies. Overall, this work established a methodological basis for automated, data-driven analysis of MOF synthesis and demonstrated the broader potential of NLP-extracted literature data for predictive materials discovery.
Whereas this study focused on predicting crystallization outcomes from literature-derived data, further research has examined how NLP-extracted information can support the inference of synthesis parameters from structural inputs. Luo et al. developed a machine-learning framework that utilizes NLP-extracted synthesis data to predict MOF synthesis conditions directly from crystal structures, aiming to move beyond empirical, trial-and-error approaches toward data-driven synthesis design (Fig. 2c).87 Using ChemicalTagger71 for entity recognition and rule-based filtering, the authors extracted synthesis information such as metal source, linker, solvent, additive, temperature, and reaction time from publications linked to Computation-Ready Experimental MOF (CoRE MOF)88 and CSD entries, constructing the SynMOF dataset containing 983 MOFs. Machine learning models were trained using molecular fingerprints of organic linkers together with metal identity and oxidation state to predict synthesis parameters. The resulting models achieved positive predictive performance, with meaningful R2 values for temperature and reaction time prediction and top-three solvent selection accuracy exceeding 90% for single-solvent systems. These results suggest that NLP-derived synthesis data can be systematically integrated with structural information to support data-driven and predictive planning of MOF synthesis.
In addition to synthesis prediction, NLP-derived datasets have also been used to examine other material characteristics relevant to practical implementation, including stability. Nandy et al. developed MOFSimplify, a data-driven platform that integrates NLP-extracted experimental stability data with machine learning models to predict the robustness of MOFs.89 Using ChemDataExtractor77 for sentence tokenization and Stanza72 for dependency parsing in combination with regular expressions, the authors mined over 5000 publications associated with the CoRE MOF,88 extracting 2179 solvent removal stability labels and 3132 thermal decomposition temperatures. These data were combined with revised autocorrelation (RAC) descriptors,90 which capture coordination chemistry, and Zeo++82 geometric features representing pore topology. The resulting dataset was used to train artificial neural network models, achieving 76% accuracy for solvent removal stability classification and a mean absolute error of 47 °C for thermal stability prediction. By systematically linking literature-mined stability data to experimentally resolved MOF structures, the study demonstrates that NLP-based workflows can generate large scale, chemically meaningful stability datasets. The MOFSimplify web platform provides open access to these data and models, representing a significant step toward data-driven, automated prediction and design of stable MOFs.
Beyond stability analysis, NLP-extracted data have also been integrated with generative machine-learning models to explore inverse design tasks in porous materials. Jensen et al. demonstrated a data-driven framework for inverse design of OSDAs by applying NLP to the zeolite synthesis literature.91 From more than 5000 reported synthesis routes from 1384 publications, the authors identified relationships among 758 OSDAs and 205 zeolite frameworks. The OSDAs were further characterized using weighted holistic invariant molecular (WHIM) descriptors92,93 to capture shape-matching effects with zeolite cavities. Using this literature-derived dataset, the authors trained a generative recurrent neural network (RNN) conditioned on zeolite topology and gel chemistry to propose new OSDA candidates. The model successfully regenerated known OSDAs and proposed new candidates for frameworks such as CHA and SFW. This study highlights how NLP-extracted synthesis knowledge, when integrated with molecular descriptors and generative machine learning, can enable inverse design in zeolite synthesis.
Although NLP-based text mining has enabled large-scale extraction of synthesis conditions, stability metrics, and structure–property relationships for porous materials, its performance remains constrained by incomplete literature access, heterogeneous reporting styles, and limited ability to capture complex experimental variables. As highlighted in recent quantitative analyses, variations in table structures, variable-based expressions, and inconsistent unit or identifier usage can substantially affect extraction performance.94 These factors introduce noise, reduce generalizability, and limit the predictive accuracy of downstream machine learning models. Moreover, commonly reported exact-match accuracy metrics are sensitive to formatting differences, normalization procedures, and multiple valid chemical representations, and therefore should be interpreted in the context of their original evaluation protocols. The level of acceptable accuracy may also vary depending on the intended downstream application. Despite these challenges, existing studies demonstrate that rule-based and classical NLP workflows can compile chemically meaningful datasets at scale. Building on these foundations, recent advances in LLMs offer the potential to overcome many of these limitations by improving entity recognition, contextual understanding, and extraction of experimental information.
In inorganic synthesis, both Schrier et al.97 and Kim et al.98 fine-tuned GPT-based models for synthesis prediction tasks. Schrier et al. demonstrated that fine-tuned GPT-3.5/4 models can predict the synthesizability and suitable precursors of inorganic compounds, achieving performance comparable to specialized graph-based machine learning models. Similarly, Kim et al. employed a lightweight fine-tuning of GPT-4o mini to predict the synthesizability of inorganic crystal polymorphs, further introducing an explainable framework where LLM-generated natural-language reasoning revealed key compositional and structural determinants of synthetic feasibility.
In addition to text-only reasoning, recent studies have begun to benchmark the multimodal capabilities of LLMs. MaCBench benchmarked the ability of vision–language models (VLMs) to process images, tables, and spectra in chemistry and materials contexts.99 While models like Claude 3.5 and GPT-4V showed strong performance in equipment recognition and simple data extraction, they still exhibited limitations in spatial reasoning and multi-step inference, which are essential for tasks such as spectral interpretation and crystal structure analysis.
Together, these studies demonstrate that LLMs are rapidly becoming integral components of chemical and materials research pipelines, supporting tasks ranging from literature-driven data extraction to interpretable molecular discovery, synthesis planning, and multimodal scientific reasoning.
As an initial demonstration, Zheng et al.51 developed the ChatGPT Chemistry Assistant (CCA) for mining MOF synthesis information from unstructured text. By employing domain-specific prompt templates (ChemPrompt Engineering), ChatGPT autonomously performed text filtering, paragraph classification, and synthesis-parameter summarization, which were previously implemented through manually coded NLP pipelines. The overall literature-to-database workflow implemented by the CCA, including human preselection, paragraph classification, and prompt-guided synthesis-condition summarization, is schematically illustrated in Fig. 3a. Across 228 representative MOF publications, the CCA extracted over 26
000 synthesis parameters (metal sources, organic linkers, solvents, temperatures), achieving 90–99% precision and recall. The extracted data further enabled a supervised machine-learning model to predict crystallization outcomes with accuracy exceeding 87%. This work demonstrated the feasibility of using general-purpose conversational LLMs to perform chemical information extraction at near-human reliability, providing a reproducible workflow for literature-to-database conversion. The authors also examined hallucination behavior, identifying two primary error modes: fabrication of plausible synthesis conditions for non-existent MOFs through name-based pattern extrapolation, and incorrect factual associations such as misidentification of a MOF's metal center. To address these issues, the prompting framework was designed to reduce overconfident responses by allowing explicit abstention when information was uncertain and by constraining answers to a curated synthesis dataset derived from the literature. However, the extraction performance strongly relies on clearly structured or semi-structured experimental descriptions, and the method remains limited in handling loosely written narrative text or heterogeneous reporting styles without explicit human preselection.
![]() | ||
| Fig. 3 Representative examples of LLM-driven literature mining systems for porous materials (a) a prompt-guided workflow that performs synthesis-condition extraction and tabular summarization from unstructured scientific text. Adapted from ref. 51. Copyright 2023 American Chemical Society. (b) A fine-tuned Named Entity Recognition and Relation Extraction (NERRE) model that produces structured, relational scientific records aligned with hierarchical schemas. Adapted from ref. 55 and licensed under CC-BY 4.0. | ||
Beyond prompt engineering, Zhang et al.33 fine-tuned GPT-3.5 and open-source models (LLaMA-3, Mistral) for chemistry-specific text mining tasks, including compound and reagent recognition, reaction-role classification, and MOF-specific synthesis information extraction (Paragraph2MOFInfo). Notably, strong performance was achieved using only a few hundred manually annotated paragraphs, with exact-match accuracy exceeding 80% for the MOF-specific extraction task. The authors acknowledge that LLMs may hallucinate by generating outputs inconsistent with established chemical knowledge, and show that supervised fine-tuning substantially reduces such unintended generations compared to prompt-only approaches. These results demonstrate that task-specific domain adaptation through fine-tuning markedly enhances consistency, reduces hallucination, and minimizes the need for extensive prompt engineering. Together, these advances establish the foundation for large-scale, autonomous chemical text mining and data-driven innovation grounded in experimental literature. Nevertheless, the scope of extractable information remains constrained by task-specific schemas and annotated training data, limiting direct transferability to unseen synthesis variables or broader classes of porous materials.
While textual information conveys much of the experimental context, crucial experimental data in porous-materials research often reside in figures such as adsorption isotherms, PXRD diffractograms, TGA curves, and microscopy images. Zheng et al.52 demonstrated a vision-enabled large language model (GPT-4V) for multimodal data mining in reticular chemistry. Using natural-language prompts, GPT-4V analyzed 6240 images from 346 MOF papers and successfully classified and, to a substantial extent, interpreted key figure types including nitrogen isotherms, powder x-ray diffraction (PXRD) patterns, thermogravimetric analysis (TGA) curves, and structural diagrams. The model achieved classification accuracies above 94% (F1 ≈ 93–95%) for major categories and was able to infer contextual attributes such as gas type, temperature, and thermal stability trends. However, the reported errors were not entirely random. Misclassification occurred recurrently in visually ambiguous or broadly defined categories. In particular, IR and NMR spectra were sometimes incorrectly identified as gas sorption isotherms within the “other isotherm” category, reflecting structural similarity between line-plot formats. On pages containing multiple plot types, partial omissions were also observed, where one coexisting figure type (e.g., TGA alongside PXRD) was missed. In addition, tasks requiring visual inference such as identifying hysteresis behavior or estimating saturation plateaus from adsorption curves showed lower accuracy than extraction of explicitly annotated textual values. Although the study did not systematically quantify hallucination instances, the authors explicitly incorporated prompt-level safeguards to minimize unsupported generation, instructing the model to rely strictly on information present in the page image and to return “N/A” when relevant data were absent or ambiguous. This design aimed to reduce hallucination-type errors, particularly fabrication of non-existent numerical or contextual information. This integration of visual and textual modalities represents a critical step toward fully digitalized experimental knowledge extraction, enabling autonomous retrieval of structure–property information from both narrative and graphical sources. Despite its strong performance, the accuracy of GPT-4V remains sensitive to prompt formulation, and misclassification was more frequently observed for visually ambiguous or broadly defined figure categories, while precise quantitative value extraction from plots remains challenging.
In some cases, literature-mined data have been further exploited to enable data-driven studies, rather than remaining as static databases. Kang et al.53 developed L2M3 (Large Language Model MOF Miner), a large-scale autonomous pipeline designed to extract and standardize textual and tabular information from MOF literature. Processing over 40
000 publications, L2M3 integrates specialized agents for table parsing, synthesis-condition recognition, and property extraction under a central controller, yielding a structured database of 32 properties and 21 synthesis categories linked to experimental entries in the CSD. The pipeline achieved F1 scores above 0.9 for extraction. The authors explicitly acknowledge the potential for hallucination and inconsistent extraction in LLM-based workflows and implement mitigation strategies including multi-stage agent chaining, structured JSON-constrained outputs, temperature control, and metadata cross-checking with the CSD to minimize fabricated or misassigned information. Leveraging this curated synthesis-condition dataset, the authors further developed a synthesis condition recommender system that suggests plausible reaction conditions based on given synthesis condition, demonstrating how literature-mined data can be transformed into an active, data-driven tool for guiding MOF synthesis. The fine-tuned recommender achieved a median recommendation score of ∼0.83, significantly outperforming prompt-based and rule-based baselines. However, because extraction quality ultimately depends on the underlying LLMs, residual errors and inconsistencies may propagate across the multi-agent pipeline, particularly at large scale, necessitating continued verification and post-processing for high-confidence applications.
In addition, Hu et al.54 demonstrated that LLM-derived knowledge can be converted into machine-readable datasets suitable for downstream analysis. In their study, the ChatGPT-4 API was used to extract and standardize synthesis parameters, pore characteristics, elemental composition and CO2 uptake values for porous carbon materials from unstructured text, thereby establishing a structured experimental database of porous carbon adsorption data composed of over 10
000 individual entries. The resulting structured dataset was later used in an AutoML framework to explore synthesis–performance relationships. Although the optimization and design steps were conducted by conventional machine-learning models rather than the LLM itself, the study demonstrates how LLM-based extraction can act as the data-standardization layer enabling automated modeling, trend identification and hypothesis generation, representing a transitional stage toward fully autonomous research pipelines. At the same time, the authors note that LLM-extracted datasets may contain redundancy or missing entries, requiring prior evaluation and manual verification, and that experimental validation remains essential to confirm trends inferred from the automatically generated data.
Recently, LLM-assisted data mining has been extended to covalent organic frameworks (COFs) for synthesis-condition prediction. Zhao et al. constructed a COF synthesis database (SYN–COF) by using the large language model Deepseek-R1 to extract monomer identities, reaction temperatures, times, and solvent systems from 609 literature sources, yielding 587 curated solvothermal entries.55 Using prompt engineering, Deepseek-R1 achieved 97.19% extraction accuracy on manually evaluated samples, outperforming a BERT-CRF model by approximately 14% while requiring no annotated training data and demonstrating over 20-fold higher efficiency than manual extraction. The extracted data were encoded via SMILES-derived molecular fingerprints and used to train multiple ML models, with XGBoost achieving R2 = 0.88 for temperature and R2 = 0.47 for reaction-time prediction. To account for the multiplicity of viable synthesis regimes, additional classification models were constructed for discretized parameter ranges and common solvent combinations. The predicted conditions were experimentally validated through the successful synthesis of a previously unreported imine-linked COF (BPQD-TPDA) under model-recommended conditions (119 °C, 90 h, o-DCB/n-BuOH), whose crystallinity and microporosity were confirmed by PXRD and nitrogen sorption analyses. However, the framework is largely confined to frequently reported dual-monomer solvothermal systems, reflecting the limited availability and uneven distribution of reported COF synthesis data, particularly for novel linkage chemistries.
In addition to direct value extraction, several studies have explored the extraction of relational and structured knowledge from materials literature. Dagdelen et al.56 developed an end-to-end joint named entity recognition and relation extraction (NERRE) framework by fine-tuning GPT- and LLaMA-based architectures to automatically capture hierarchical relationships among entities in materials literature. The assisted annotation and fine-tuning workflow for joint entity and relation extraction is schematically illustrated in Fig. 3b. One benchmark task focused on MOFs, where models were trained on several hundred abstracts to identify MOF names, chemical formulae, guest species, applications, and descriptive attributes, and to organize these entities into a predefined JSON-based hierarchical schema. This approach effectively generated structured records such as MOF-guest-application triads, providing a relational view of MOF chemistry and function that is difficult to achieve with rule-based or BERT-like pipelines. The fine-tuned LLMs demonstrated strong performance (e.g., F1 ≈ 0.57 for name–application and 0.62 for name–guest relations) and showed the ability to normalize and correct chemical entities automatically. The authors explicitly acknowledge hallucination as a limitation, noting that the model may generate or infer chemical names or formulae not explicitly present in the input text. Although such inferences can be chemically plausible, they are considered inappropriate for strict information extraction, as extracted entities should be directly grounded in the source passage. For porous materials research, such structured and relation-aware representations provide a foundation for building knowledge graphs linking literature-derived insights, thereby supporting large-scale data integration, semantic search, and data-driven hypothesis generation. Nonetheless, the method occasionally produces formatting inconsistencies and hallucinated relations not explicitly supported by the source text, indicating the continued need for schema verification and human oversight in high-stakes applications.
Rather than focusing solely on relation extraction, LLMs have been applied to more complex reasoning tasks that require evidence aggregation and verification. Ansari and Moosavi57 proposed Eunomia, an autonomous chemistry agent that advances LLM-based extraction from static text parsing to dynamic reasoning and verification. Eunomia uses a large language model (LLM) in a ReAct-style agent setup, allowing the model to think step by step and decide when to search documents or verify its own answers. This design makes it possible to handle multi-step information extraction and to reason over entire papers without fine-tuning. Importantly, the authors explicitly discuss hallucination as a key limitation of LLM-based systems, defining it as the generation of unsupported or fabricated information, and incorporate a Chain-of-Verification (CoV) module to re-examine extracted evidence before producing final outputs, thereby reducing ungrounded or incorrect content. When benchmarked on three information-extraction tasks of increasing complexity (solid-state doping relations, MOF chemical-formula and guest-species identification, and MOF water-stability classification), Eunomia achieved zero-shot performance comparable to or exceeding fine-tuned LLMs. For MOF formula extraction, Eunomia increased the F1 score from 0.424 (fine-tuned baseline) to 0.606, while showing high recall (0.923) in guest-species identification. In the most challenging task of MOF water-stability classification at the full-paper level, Eunomia achieved a ternary accuracy of 0.91 with an information recovery yield of 86.2%. This work highlights a broader shift from fine-tuned or prompt-engineered models toward tool-augmented, self-verifying agents for materials information extraction, demonstrating the potential for more accurate and scalable database generation from the literature. Despite these advantages, Eunomia's performance remains highly dependent on clear task decomposition and prompt design, and the added system complexity introduced by multi-step reasoning and tool usage may affect robustness in the absence of carefully engineered guidance.
To assess this expanded functional scope, several studies have focused on systematically evaluating the capabilities of LLMs across a broad range of MOF-related research tasks. Bai et al.58 systematically evaluated six open-source LLMs, including LLaMA2-7B, ChatGLM2-6B, and Falcon-7B, across a comprehensive suite of MOF-related tasks such as chemistry knowledge, MOF database reading, experiment design, computational script generation, data analysis, and property prediction (Fig. 4a). Their results showed that moderate-sized models (6–7 billion parameters) demonstrated reasonable understanding of domain-specific concepts and could generate usable experimental designs and simulation inputs with performance comparable to GPT-3.5 in several qualitative and semi-structured tasks, including MOF knowledge recall, database querying, and the generation of experimental designs and computational scripts. Among the evaluated models, LLaMA2-7B and ChatGLM2-6B consistently exhibited the most balanced performance across these tasks, combining reliable domain understanding with moderate computational requirements. Despite generally strong performance in knowledge retrieval and research-assistance tasks, the evaluated models exhibited limited MOF-specific depth, showed constrained reliability in property-related reasoning, and often generated experimentally plausible but insufficiently specific suggestions without domain-specific fine-tuning. This study provided a systematic comparison of multiple open-source models and offered practical guidance for improving their fine-tuning and domain adaptation in future porous-materials research.
![]() | ||
| Fig. 4 Expansion of LLM applications in porous materials. (a) Evaluation of open-source LLM capabilities for chemistry- and MOF-focused tasks. Adapted from ref. 57. Copyright 2024 American Chemical Society (b) LLM-assisted MOF linker mutation workflow with experimental validation. Adapted from ref. 60. Copyright 2023 American Chemical Society. | ||
To support deeper reasoning and knowledge-grounded workflows, recent efforts have focused on constructing large-scale question–answer corpora that encapsulate reticular-chemistry knowledge in a machine-interpretable form. Rampal et al.59 introduced RetChemQA, a large-scale benchmark dataset designed to evaluate the reasoning and comprehension abilities of LLMs in reticular chemistry. RetChemQA comprises around 90
000 single- and multi-hop question–answer pairs generated from approximately 2500 MOF-related publications using GPT-4-Turbo. The dataset spans factual, reasoning, and true/false question types, enabling fine-grained assessment of model understanding across scientific tasks. In evaluating model reliability, the authors explicitly address hallucination by defining it as the generation of out-of-context Q & A pairs and introduce quantitative metrics, including hallucination rate and hallucination capture rate, to systematically evaluate and analyze such behavior. Moreover, it provides a foundation for developing automated prompt optimization frameworks such as DSPy,100 facilitating iterative improvement of LLM performance without manual intervention. This work establishes a shared benchmark for evaluating complex reasoning in reticular chemistry. Although RetChemQA provides a large-scale, reasoning-oriented QA corpus, the generated question–answer pairs and downstream model outputs still require human validation, particularly for synthesis feasibility and structural correctness.
Beyond enabling knowledge representation and reasoning workflows, an important question is whether such language-based learning can be extended to structure–dependent property prediction in porous materials. Wu and Jiang60 presented one of the first demonstrations of applying a fine-tuned general-purpose LLM (Gemini-1.5) to predict the hydrophobicity of MOFs. In their framework, MOF structures were represented as chemical strings (SMILES and SELFIES, Self-Referencing Embedded Strings) and used to fine-tune Gemini-1.5 as a supervised end-to-end classifier, enabling the model to learn latent chemical language patterns of structural motifs. Importantly, the fine-tuned Gemini directly outputs hydrophobicity class labels from symbolic MOF representations, without relying on external feature engineering or downstream machine-learning predictors. The fine-tuned Gemini achieved a weighted accuracy of up to 0.78 for binary classification and 0.73 for quaternary classification, outperforming descriptor-based SVM models built on pore and RAC features, which reported weighted accuracies on the order of 0.75 (binary) and 0.70 (quaternary). The model further retained robust performance even under moiety-masking (partial-input) conditions. This study highlights that with minimal domain-specific retraining, LLMs can infer physicochemical properties directly from symbolic chemical representations, bridging the gap between text-based learning and quantitative materials prediction. However, the model exhibited reduced predictive performance when applied to solvent- or ion-containing MOFs outside the training distribution, highlighting challenges in out-of-distribution generalization.
In addition to predicting material properties, LLMs can also actively contribute to the design and synthesis of entirely new materials. Zheng et al.61 introduced an LLM-based generative design framework for MOF linker mutation, coupling data curation, model fine-tuning, and experimental synthesis for water-harvesting (Fig. 4b). Using a curated dataset of 3943 linker-editing examples covering four mutation categories, including substitution, insertion, replacement, and positioning, the fine-tuned model achieved significantly higher accuracy (84.8%) and recall (93.9%) in generating valid chemical structures compared with the base GPT-3.5 and GPT-4 models. The authors further note that base models sometimes produced hallucinated SMILES strings that were syntactically plausible but chemically invalid or inconsistent with the specified mutation instructions, highlighting the need for task-specific fine-tuning. The model proposed new linker variants predicted to enhance water-harvesting performance, which were subsequently synthesized into the LAMOF series (LAMOF-1 to LAMOF-10). These MOFs feature heteroatom-substituted linkers and demonstrate record water uptake (up to 0.64 g g−1) with tunable humidity response (13–53% RH). This study provided a practical demonstration of LLM-driven reticular chemistry, where fine-tuned language models can act as AI co-designers that accelerate the generation and synthesis of functionally enhanced, synthetically feasible MOFs. However, the model can generate new structures only within the space defined by combinations of linker-editing rules represented in the training data.
These LLM-driven automation strategies are being actively and broadly adopted across diverse materials and chemistry domains, suggesting broad applicability and the potential for transformative impact. In the field of quantum chemistry, significant progress has been made in democratizing access to sophisticated computational tools. Gadde et al. introduced AutoSolvateWeb,105 a chatbot-assisted platform that employs Google Dialogflow CX106 to guide non-expert users through multistep quantum mechanical/molecular mechanical (QM/MM) simulations of explicitly solvated molecules, while operating on cloud infrastructure to completely eliminate hardware configuration barriers.105 Zou et al. developed El Agente Q, an LLM-based multi-agent system which implements a hierarchical multi-agent architecture where specialized agents collaboratively handle dynamic task decomposition, adaptive tool selection, and post-analysis in quantum chemistry.107 El Agente Q reported an 87% task success rate in university-level quantum chemistry benchmarks, thereby enabling users to execute complex workflows from natural language prompts without external intervention.
In experimental synthesis, the integration of LLMs with robotic laboratory systems has demonstrated increasing levels of autonomous operation. Song et al. introduced ChemAgents, a hierarchical multiagent-driven robotic AI chemist.108 It is powered by an LLM (Llama-3.1-70B) that coordinates four specialized agents: the Literature Reader, Experiment Designer, Computation Performer, and Robot Operator. A key achievement was the discovery and optimization of high-performance metal–organic high-entropy catalysts (MO-HECs) for the oxygen evolution reaction (OER). Similarly, Huang et al. reported a natural-language-interfaced robotic platform for inorganic materials translating synthetic procedures directly into executable operations.109 This platform autonomously synthesized 13 compounds across four material classes: coordination complexes, MOFs, nanoparticles, and polyoxometalates. Furthermore, through AI copilot-assisted exploration, the system discovered four previously unreported Mn–W polyoxometalate clusters (specifically Mn4W18, Mn4W8, Mn8W26, and Mn57W42). These advances highlight the potential of human-AI collaboration in accelerating materials discovery. Together, these studies demonstrate how LLM-driven automation bridges reasoning and execution, advancing chemistry toward fully autonomous discovery.
Building upon these developments, we now consider how similar approaches are emerging in porous materials research. In the following section, we examine representative examples of human-in-the-loop, closed-loop automation and LLM–robotics integration.
One prominent example is the GPT-4 Reticular Chemist, a pioneering framework designed to guide the discovery of new MOFs through seamless, conversational collaboration between a chemist and an LLM.62 Operating entirely through natural language, it eliminates the need for specialized coding skills, making advanced LLM-based reasoning accessible to any researcher in the field. The system's workflow is structured into three distinct but interconnected phases: (1) reticular ChemScope, which generates a high-level conceptual research plan based on user input; (2) reticular ChemNavigator, which evaluates experimental outcomes and proposes the next set of actions with supporting rationale; and (3) reticular ChemExecutor, which produces detailed step-by-step laboratory protocols for execution. A feedback mechanism links these stages. After each experiment, the researcher provides a natural-language summary of the outcome, which the LLM incorporates as contextual memory to refine subsequent decisions. To mitigate potential relation-type hallucination, where the LLM may infer incorrect relationships or provide flawed interpretations of experimental data such as NMR or TGA, the authors adopted a human-led analysis strategy supplemented by conventional ML algorithms to ensure analytical reliability. The workflow was applied to the exploration of the MOF-521 isoreticular series, guiding progression from linker synthesis and reaction-condition refinement to final characterization, indicating how LLM-assisted planning can complement human experimental execution. The authors also report good agreement between simulated and experimental PXRD patterns, with experimentally measured surface areas consistent with computational predictions. Furthermore, the authors discuss reproducibility at the level of the discovery process rather than in terms of identical textual outputs. In their study, the total number of prompt iterations required to reach optimized synthesis and characterization stages was similar across different MOF-521 derivatives, despite variations in intermediate decision paths. This observation suggests that, in conversational LLM-driven workflows, reproducibility has been discussed in terms of achieving similar overall iteration counts and workload, even when intermediate decision paths differ.
Beyond the GPT-4 reticular Chemist, another representative example of HITL reasoning is the MOFsyn agent, an advanced framework that enhances the synthesis optimization of MOF catalysts through a retrieval-augmented generation (RAG) framework (Fig. 5a).63 MOFsyn integrates three synergistic components: (1) the Data Automatic Analyzer, which autonomously performs ML-based analysis of catalytic data without coding expertise; (2) the Material Mechanism Analyzer, which combines RAG-based literature querying with real-time online searches to provide mechanistic insights and synthesis recommendations; and (3) the Experimental Protocol Navigator, which guides iterative human–AI collaboration through adaptive experimental design. Underlying the interaction among these components, MOFsyn employs chain-of-thought (CoT) reasoning to decompose complex synthesis optimization tasks into explicit, sequential reasoning steps.
![]() | ||
| Fig. 5 Representative examples of large language model (LLM)–enabled workflows discussed in Section 2.3. (a) MOFsyn agent illustrating an LLM-enabled human-in-the-loop framework for synthesis guidance. Adapted from ref. 62. Copyright 2025 American Chemical Society (b) autonomous MOF design framework (ChatMOF) combining LLM reasoning, property prediction, and inverse design. Adapted from ref. 23 and licensed under CC-BY 4.0. (c) An example of robotics-assisted experimentation: the ChatGPT Research Group multi-agent workflow. Adapted from ref. 24 and licensed under CC-BY 4.0. | ||
The RAG framework in MOFsyn grounds synthesis reasoning in domain-specific knowledge curated from the literature and supplemented by real-time web retrieval, addressing the limitations of static LLM training. The local corpus was constructed from 508 records retrieved from the Web of Science Core Collection on June 17, 2024, using targeted search queries focusing on Ce-based MOFs for hydrogenation, Ni nanoparticle-catalyzed hydrogenation, structure–activity relationships of Ce-MOF-supported Ni catalysts, and DCPD hydrogenation mechanisms. Retrieval over the curated corpus employs an ensemble strategy that combines Facebook AI similarity search (FAISS, for dense embedding retrieval)110 and Best Matching 25 (BM25, for sparse lexical retrieval)111 with equal weighting (0.5/0.5). This design leverages the complementary strengths of sparse retrieval in precisely matching specific chemical formulas and identifiers and dense retrieval in capturing broader semantic contexts and thematic analogies, effectively balancing the trade-off between keyword-level accuracy and contextual relevance. The ensemble retriever112 first identifies the top-10 candidate documents from the local corpus. These are then combined with web-retrieved candidates to form a preliminary pool. Retrieval precision is enhanced through a two-stage reranking pipeline: cosine similarity filtering (threshold = 0.8) followed by cross-encoder semantic reranking using Cohere's rerank-english-v2.0 model. The multi-stage design mitigates the inclusion of weakly relevant documents from initial retrieval, improving contextual precision before prompt injection. After reranking, the five most relevant documents are selected and injected into the final prompt as contextual grounding.
To optimize LLM prompting for RAG reasoning, MOFsyn employs a specialized prompt-engineering framework called R.O.S.E.S (role definition, objective statement, scenario description, expected solution, and structured output). In both the Retrieval-Augmented Generation Assessment (RAGAS)113 and a 100-question MOF materials test, GPT-4o achieved the highest overall RAG performance than Deepseek-V3, GLM-4-Flash-250414, and Qwen2.5-MAX under the evaluated configuration (Fig. 6a). Retrieval quality was quantified using standard RAGAS metrics, faithfulness, answer relevance, context recall, and answer similarity, alongside performance on the domain-specific test set. For GPT-4o, ensemble retrieval improved performance from 89% (no retriever) to 95%. In practical deployment, online retrieval typically responds within 1–2 s. However, multi-stage retrieval and reranking introduce additional latency and token overhead, particularly as corpus size increases, which should be explicitly considered in reproducibility and scalability evaluations. Regarding reproducibility, the authors explicitly controlled generative variability by setting the inference temperature to 0.1, with the stated goal of reducing stochastic variability and obtaining accurate and consistent outputs during synthesis reasoning.
![]() | ||
| Fig. 6 Benchmarking the performance of different large language models (LLMs) used in emerging LLM-enabled frameworks for porous materials research. (a) Impact of retrieval augmentation in MOFsyn, where ensemble retrieval provides the highest accuracy overall (top), with model- and category-dependent variation observed across MOFs materials exam (bottom). Adapted from ref. 62. Copyright 2025 American Chemical Society (b) foundation model benchmarking in SciToolAgent shows OpenAI-o1 as the best-performing model, with GPT-4o providing the most efficient accuracy–cost balance. Performance metrics represent mean values over repeated benchmark evaluations (n = 5), with GPT-4o showing strong overall performance across task categories. Adapted from ref. 64. Copyright 2025 Springer Nature America. | ||
In practical application, the system optimized the synthesis of a Ni@UiO-66(Ce) catalyst. MOFsyn identified limitations in conventional one-pot reduction routes, and recommended a two-step low-temperature reduction pathway, which later yielded improved selectivity and conversion. These improvements were further contextualized by confirming structural integrity of UiO-66(Ce) via XRD and validating the Ni0 active species, with the effects of excess NaBH4 discussed in accordance with literature-reported mechanisms. Taken together, this case illustrates how retrieval-grounded reasoning can support iterative decision-making within a human–AI synthesis workflow.
A representative example of LLM-integrated computational exploration is found in LLM-guided design of organic structure-directing agents (OSDAs) for zeolite synthesis.64 Although not directly targeting porous frameworks themselves, OSDAs play a pivotal role in zeolite synthesis, which represents one of the most important classes of porous materials.114 OSDAs are quaternary ammonium cations that guide zeolite crystallization and stabilize specific framework topologies.115 In this work, Ito et al.64 developed an iterative closed-loop workflow that couples GPT-4 with atomistic simulations to progressively refine OSDA candidates. The LLM generates new OSDA molecules starting from the simple prototype tetraethylammonium (TMA). Empirical domain filters, such as a carbon-to-nitrogen ratio between 4 and 20 and fewer than five rotatable bonds, are applied, and unsuitable molecules are rejected with natural language feedback returned to the LLM. Screened molecules undergo stabilization-energy calculations to obtain zeolite affinity scores, and qualified candidates are stored in a growing database. The authors further examined how output variability depends on the model temperature by systematically analyzing GPT-4-generated OSDA molecules under fixed input prompts. At a temperature of 0.3, the model produced relatively deterministic outputs, but a noticeable fraction of the generated molecules were duplicated or closely resembled the inputs, leading to limited diversity. At temperatures below 0.3, the fraction of unique molecules further decreased, indicating inefficient exploration of chemical space. By contrast, increasing the temperature to intermediate values (0.7–1.1) resulted in more diverse molecular proposals while maintaining a high fraction of syntactically valid SMILES. At higher temperatures exceeding 1.3, the fraction of parsable SMILES decreased markedly due to fabrication-type hallucination, where the model produced nonsensical or near-random character sequences leading to syntactically invalid SMILES strings. Based on these observations, the authors adopted a stochastic sampling strategy by randomly selecting temperatures from 0.7 to 1.1 during iterative design, explicitly balancing reproducibility, molecular diversity, and exploration efficiency within the closed-loop workflow.
Using this temperature-controlled design strategy, the workflow was tested on three cage-type zeolites: CHA,116 AEI,117 and ITE.118 The resulting candidates showed stabilization energies comparable to or higher than those reported for experimentally validated OSDAs. However, synthesizability was not considered in the current implementation, and the model occasionally failed to infer realistic synthetic pathways for structurally complex candidates. These observations indicate that further integration of synthesis-oriented considerations would be beneficial for translating computationally proposed OSDAs into experimentally realizable systems.
Beyond tightly coupled simulation feedback loops, related studies emphasize LLM-driven orchestration across complex computational workflows. In this direction, Ding et al. proposed SciToolAgent, an LLM-based framework designed to autonomously orchestrate hundreds of specialized scientific tools across biology, chemistry, and materials science through the scientific tool knowledge graph (SciToolKG).65 The core of SciToolAgent lies in leveraging SciToolKG to mediate communication between the LLM and integrated scientific tools, enabling intelligent tool selection and execution through graph-based RAG. The system consists of three main modules: the planner, executor, and summarizer. When the summarizer identifies suboptimal outcomes, it prompts the planner to refine the tool chain, reducing trial-and-error costs and iteratively optimizing until satisfactory results are achieved.
SciToolAgent reported 94% accuracy on benchmark datasets, outperforming previous models by approximately 10%. Among various foundation models tested, OpenAI o1 showed the highest absolute accuracy, whereas GPT-4o offered the most favorable balance between performance and computational cost and was therefore selected as the default model (Fig. 6b). Consistent performance trends observed across repeated benchmark evaluations (n = 5) and multiple foundation models provide an indirect indication of system-level reliability. The framework demonstrated versatility across four case studies: protein design, chemical reactivity prediction, synthesis planning, and MOF screening. In the MOF screening task, SciToolAgent autonomously identified a MOF with high thermal stability (above 400 °C), high CO2 adsorption capacity (above 100 mg g−1), and low synthesis cost (below ¥100). It then constructed and executed a sequential workflow integrating a machine-learning-based predictor (MOFSimplify89) and a molecular simulation tool (RASPA2 (ref. 119)) to select the optimal candidate. However, despite its promising performance, the system still faces scalability limitations due to the manual construction of SciToolKG and a reliance on proprietary models such as GPT-4o.
Recent approaches have pushed autonomous systems toward knowledge-grounded inverse design, where LLMs actively reason about how to modify material structures to achieve target properties. In this context, Ansari et al. introduced the dZiner framework, which exemplifies inverse design automation that explicitly incorporates domain knowledge into the generative loop.66 It performs rational inverse design through a RAG loop that iteratively retrieves domain knowledge from literature, proposes chemical modifications, and assesses feasibility through LLM-based reasoning. Underlying this process, dZiner employs chain-of-thought (CoT) reasoning to translate retrieved design principles into explicit, stepwise modification strategies, preserving interpretable reasoning traces across successive iterations.
The workflow consists of three core capabilities: extracting design rules from text, generating new candidates in natural language, and evaluating them using surrogate models such as MOFormer120 for CO2 adsorption prediction. Powered by Claude 3.5 Sonnet and GPT-4o, dZiner supports both fully automated reasoning cycles and an optional human-in-the-loop mode, enabling collaboration when expert oversight is desired. When tasked with designing a linker for a high-uptake MOF, the model autonomously suggested nitrogen-rich and fluorinated heterocycles that enhanced the predicted CO2 adsorption capacity, indicating that textual chemical knowledge can be coupled with surrogate model feedback in an automated loop. Here, the agent was implemented using a fixed inference temperature (0.3) and a ReAct-based chain-of-thought architecture, while candidate evaluation relied on surrogate models such as MOFormer.120 To mitigate fabrication-type hallucination, where the LLM produced syntactically invalid SMILES corresponding to non-existent or chemically infeasible structures, the authors introduced an additional RDKit-based verification step to filter out invalid candidates before downstream evaluation.
While dZiner represents a major step toward automatic inverse design, its SMILES-based representation inherently oversimplifies complex frameworks such as MOFs, and its lack of multimodal understanding (e.g., interpretation of images, schemes, or structural diagrams) remains a key limitation.
At a higher level of semantic abstraction, LLM-mediated autonomous systems aim to directly mediate between natural language descriptions of porous materials and structured representations, enabling language-driven reasoning and modular tool orchestration beyond manually implemented optimization pipelines. Building on recent advances in large-language-model-based materials research, Kang et al. developed ChatMOF, an autonomous AI system for predicting and generating MOFs (Fig. 5b).23 ChatMOF couples LLMs (GPT-4 and GPT-3.5) with external materials databases, including CoRE MOF,88 QMOF,121 and DigiMOF,83 as well as machine-learning-based property predictors such as MOFTransformer,30 within a LangChain-based tool-orchestration framework inspired by the ReAct122 and modular reasoning, knowledge and language (MRKL)123 architecture.
Within this architecture, the ReAct paradigm enables ChatMOF to explicitly alternate between reasoning steps, where the LLM interprets user intent and plans subsequent actions, and action steps, in which specialized tools such as databases, predictors, or generative algorithms are invoked. Complementarily, the MRKL design allows the LLM to route queries to appropriate expert modules rather than attempting to solve all tasks internally, thereby leveraging domain-specific tools while maintaining a unified language-driven interface. During the reported evaluations, ChatMOF was operated without model fine-tuning, with a low temperature setting (0.1) used during inference.
Through this modular reasoning–acting pipeline, ChatMOF performs a range of natural-language-driven tasks, including data retrieval, property prediction, and inverse structure generation using genetic algorithms. Under GPT-4 evaluation, the system achieved up to 96.9% accuracy in search tasks, 95.7% in property prediction, and 87.5% in generative design benchmarks, excluding cases involving token-length overflow. These benchmark tasks primarily evaluate the correctness of tool selection and multi-step orchestration on curated query sets rather than end-to-end experimental research workflows. In this context, generative validity refers to producing structurally consistent MOF representations suitable for downstream computation and does not in itself imply synthetic accessibility or experimental performance improvement. By embedding LLMs as autonomous agents capable of coordinating databases, predictive models, and generative routines, ChatMOF exemplifies an emerging class of LLM-orchestrated discovery systems that bridge human dialogue, materials knowledge bases, and computational evaluation tools. Nonetheless, its performance remains constrained by token-length limitations, occasional reasoning failures, and limited generative diversity during MOF generation.
Together, these studies illustrate how LLMs function as research co-pilots across a range of computationally structured discovery workflows, from interactive synthesis planning to closed-loop exploration and workflow orchestration.
In recent years, robotic platforms have been combined with data-driven algorithms such as Bayesian optimization and genetic algorithms. These approaches have been used to efficiently search synthesis spaces and identify optimal conditions. For instance, Bayesian optimization accelerated the synthesis of ZIF-67,124 and genetic algorithms guided the growth of HKUST-1 thin films of surface-anchored MOFs (SURMOFs) with high crystallinity and uniform orientation.125
More recently, a growing research direction has explored the integration of LLMs into robotic workflows, particularly at the level of experimental planning and strategy formulation. LLMs can interpret scientific literature, propose synthesis strategies, and assist in automating experiment design. Although such systems do not yet operate in real time or without human input, recent examples such as ChatGPT Research Group,24 MOFGen67 and the green synthesis of Zn-HKUST-1 (ref. 68) illustrate how LLM-guided workflows can bridge conceptual reasoning and physical experimentation. These integrated pipelines mark the early stages of SDL systems, where AI agents increasingly support experiment design, execution, and evaluation as part of a unified workflow.
A representative example of this new paradigm is the ChatGPT Research Group24 (Fig. 5c). Composed of seven AI agents integrated with Bayesian Optimization (BO), the system was designed to automate experimental planning and optimization for microwave synthesis. Each agent was assigned a specific scientific role through prompt engineering, including strategy planning, literature search, coding, robotic operation, labware design, safety inspection, and data analysis. In particular, the safety-related responsibilities were implemented through a dedicated chemistry consultant agent that provided guidance on laboratory safety precautions, such as handling microwave irradiation, pressure buildup, and chemical hazards during synthesis. Outputs were passed sequentially between agents, forming a multi-step reasoning pipeline rather than a single monolithic model.
The framework was applied to optimize synthesis conditions for MOF-321, MOF-322, and COF-323. For MOF-321, it identified optimal conditions after 120 robotic experiments conducted over 4.5 days. To evaluate synthesis outcomes, the authors introduced a crystallinity index (CI) defined as the ratio between the height of the primary diffraction peak and its full width at half maximum, with higher values corresponding to sharper and more crystalline products. The index increased steadily across iterations (1 to 36), indicating progressive convergence rather than random parameter exploration. Similar optimization trends were observed for MOF-322 and COF-323, suggesting that the workflow may be extensible to other porous materials systems. Despite its success, the framework remains semi-automated. The authors note that future integration with more advanced robotic platforms could further enhance its autonomy and experimental throughput. The authors also emphasize that the crystallinity index serves as a proxy optimization metric and does not necessarily guarantee improved porosity or water uptake.
MOFGen is a multi-agent AI system developed to accelerate the discovery of MOFs while ensuring their synthetic feasibility.67 The framework integrates LLMs, diffusion-based generative models, machine-learning force fields, quantum mechanical computations, synthesis accessibility predictors, and robotic experimentation. Within the system, an LLM agent referred to as MOFMaster defines design constraints and coordinates the overall workflow. LinkerGen proposes novel linker molecules based on these constraints, and CrystalGen subsequently generates three-dimensional crystal structures using a diffusion model trained on MOF data. Candidate structures are then optimized using QForge, which applies quantum mechanical screening and filters out unstable or non-porous configurations. The authors also report that the generated linker chemistry exhibits features consistent with experimental trends in Zn-based MOFs, including a high prevalence of dicarboxylate and aromatic motifs and comparatively rare pyridine-containing linkers. They further observe that structures generated with Zn SBUs tended to display greater stability relative to other metal SBUs, aligning with the prevalence of Zn-based frameworks in experimental databases. The synthetic accessibility of remaining candidates is assessed by SynthABLE using a set of machine-learned rules derived from experimental data. QHarden then performs energy refinement by sequentially applying different levels of density functional theory. Finally, SynthGen conducts experimental validation through high-throughput robotic synthesis using a programmable liquid-handling platform that is guided by LLM-generated instructions and followed by X-ray characterization.
Using a combination of crossover mutation, structure reimagining, and de novo generation, MOFGen produced five experimentally realized MOFs, including ones incorporating a previously unused linker in MOFs, 2,3-dimethyl-2-butenedioic acid. Although human intervention was still necessary during model reliability assessments and structure revision, the system represents an early example of integrating LLM-based reasoning with autonomous experimentation. In its current form, MOFGen may be regarded as a transitional stage toward self-driving laboratories, where computational design and robotic execution are connected within a unified workflow.
As a complementary example, the study demonstrates the integration of LLMs with robotic synthesis to accelerate the discovery of greener synthetic routes for porous materials.68 Vu et al. used an LLM to extract and structure nitrate-based Zn-HKUST-1 synthesis conditions and then infer plausible concentration ranges for substituting Zn(NO3)2 with ZnCl2, which the authors described as an environmentally preferable alternative. Based on the data extracted by the LLM, candidate experimental conditions were identified, streamlining the traditional process of manual literature review.
These conditions were then executed using an OT-2 pipetting robot, which rapidly screened 22 concentration variations in five minutes, significantly reducing experimental setup time compared to manual handling. After synthesis, optical microscopy images were automatically evaluated using a CLIP-based classifier capable of distinguishing crystalline versus non-crystalline products.
Through this workflow, Zn-HKUST-1 crystals were successfully synthesized at a ZnCl2 concentration of 0.6 M, and scanning electron microscopy (SEM) confirmed the expected cubic morphology. The obtained products were further verified against International Centre for Diffraction Data (ICDD) reference patterns by matching XRD peak positions and lattice constants, which were found to be close to reported literature values. While the combination of LLM reasoning, automation, and AI-based evaluation considerably reduced experiment iteration time, human oversight remained necessary, and the pipeline is not yet fully autonomous. This caution reflects the risk of relation-type hallucination, in which the AI may incorrectly predict promising synthesis conditions that are experimentally unfeasible or suboptimal despite appearing chemically plausible. Nevertheless, this work demonstrates a meaningful progression toward self-driving laboratory frameworks by integrating multiple stages of the experimental cycle under AI guidance.
Taken together, these developments reflect a gradual shift from LLM-assisted reasoning toward increasingly automated experimentation in porous materials research. While current systems remain dependent on human oversight and are limited by model reliability and hardware constraints, they establish the groundwork for more integrated workflows. As multimodal understanding, tool interoperability, and experiment-aware feedback continue to advance, LLM-driven platforms are expected to accelerate discovery and bring the field closer to practical self-driving laboratories. At the same time, scaling these systems toward higher levels of autonomy will require careful consideration of failure modes, safety governance, and structured risk assessment frameworks, particularly when operating under potentially hazardous experimental conditions.
A second limitation is the potential for bias in LLM-generated outputs within porous materials research.130 This issue arises in part because scientific literature is disproportionately skewed toward reporting successful experiments, while failed or inconclusive outcomes are rarely documented. Taniike and Takahashi note that most publications emphasize high-performing catalysts or successful reaction outcomes, and data-driven models may learn to regard only reported catalyst compositions or reaction conditions as correct solutions.131 As a result, model outputs may reproduce established strategies rather than propose unconventional hypotheses or underexplored directions. More specifically, literature imbalance may affect different LLM workflows in different ways.132 In schema-based extraction of reported synthesis conditions, performance may depend more strongly on reporting style and prompt design than on literature frequency alone. In contrast, for generative tasks such as suggesting synthesis pathways or prioritizing candidates, models may favor well-represented families and conventional routes that are more frequently observed in training data and the literature.133 Systematic, class-stratified evaluations across porous material families remain limited and would benefit from more dedicated benchmarking efforts. In addition, unsuccessful attempts are rarely quantified in reported case studies, making it difficult to rigorously assess the true discovery efficiency of LLM-assisted workflows. Systematic reporting of failure rates and attempted conditions would provide a more realistic evaluation of model-guided synthesis and help mitigate publication bias.
Finally, achieving fully autonomous self-driving laboratories remains difficult at the current stage. For example, certain steps in porous material synthesis, such as thermal–solvation processes, still require human intervention and cannot yet be executed reliably by automated platforms.68 In addition, while LLMs excel at natural language reasoning, they do not inherently understand causal relationships, physical constraints, or experimental feasibility. For instance, Mandal et al. report that although Claude-3.5-Sonnet performs well on materials science benchmarks, its performance drops notably when deployed in an autonomous atomic force microscopy (AFM) framework.127 They further observe that LLM-controlled experimental behavior can be highly sensitive to minor variations in prompt phrasing, introducing instability in execution. Moreover, Kitchin notes that dynamically generated experimental procedures may pose reproducibility challenges and raise safety or security concerns when executed without adequate oversight.134 Taken together, these observations indicate that, despite rapid progress, LLMs continue to face substantial obstacles in enabling fully automated experimental platforms.
Prompt engineering relies on zero-shot135 or few-shot136 learning, where the model performs a task (i) without examples (zero-shot) or (ii) with a small number of in-context examples (few-shot) to shape the intended behavior without modifying model parameters. Zero-shot learning allows the model to generalize purely from prior pretraining, whereas few-shot learning leverages minimal contextual demonstrations to steer task behavior. Because no additional training is required, this approach incurs relatively low financial and computational cost and can be implemented using limited domain-specific data. Moreover, prompt-based control allows LLMs to flexibly handle a wide range of task variations, such as data extraction from literature, hypothesis generation, and iterative decision-making, by adjusting instructions rather than model weights. For example, CCA51 and GPT-4V Image Mining52 employed prompt-based strategies for literature data extraction and multimodal figure analysis without parameter updates, demonstrating that structured information retrieval tasks can be performed effectively through well-designed prompts alone. Similarly, autonomous and agent-based systems such as the GPT-4 Reticular Chemist,59 and dZiner66 relied primarily on prompt-driven reasoning frameworks to support materials design and synthesis planning, highlighting the practical viability of prompt engineering in complex, multi-step workflows. This flexibility makes prompt engineering particularly attractive for autonomous or semi-autonomous research systems, where models must respond adaptively to new experimental outcomes, shifting objectives, or unforeseen edge cases.
Fine-tuning, in contrast, involves explicitly updating model parameters using curated domain-specific datasets to create specialized LLMs optimized for a narrow class of tasks. In porous materials research, such fine-tuning substantially improves accuracy in tasks that require precise interpretation of scientific information. For example, previous work, such as L2M3,53 Paragraph2MOFInfo,33 and NERRE Extractor,56 demonstrate markedly higher reliability in data extraction, categorization, and entity recognition. Moreover, fine-tuned LLMs benefit from domain-specific knowledge that can directly support design-oriented applications, such as the LLM-based hydrophobicity predictor or the MOF linker mutation model. However, this specialization comes at a cost. Fine-tuned models may lose some of the generality and creative reasoning capacity of their base counterparts, making them less suitable for open-ended or exploratory tasks. In addition, fine-tuning requires substantial investment in data collection, annotation, and quality control, as well as significant computational resources. For proprietary LLMs, the need for retraining or customized fine-tuning can be restricted or expensive, further limiting practical deployment. When domain-specific datasets are relatively limited in size, fine-tuning can enhance task-specific performance. To mitigate overfitting, Parameter-Efficient Fine-Tuning (PEFT) methods such as Low-Rank Adaptation (LoRA) are particularly effective for sparse materials data.137 Nevertheless, additional validation remains essential to ensure robustness beyond the training distribution.
Within fine-tuning approaches, parameter-efficient variants such as LoRA-based adaptation have also been explored. For example, SciToolAgent65 fine-tuned Qwen2.5-7B using a LoRA configuration, and L2M3 (ref. 53) applied LoRA to Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct models for comparative evaluation. However, these implementations were primarily used to benchmark or contrast model performance rather than as widely deployed adaptation strategies, and reported performance remained below that of larger proprietary GPT-based systems in the respective evaluation settings. In porous materials research, parameter-efficient tuning therefore appears less frequently as a dominant operational paradigm. As reflected in Table 1, prompt engineering and full fine-tuning remain the prevailing strategies in current practice.
Taken together, prompt engineering and fine-tuning should be viewed not as competing remedies but as complementary tools within the LLM adaptation landscape. Across the reviewed studies, the choice between these strategies has been largely task-dependent. Prompt engineering favors rapid prototyping, low-cost deployment, and adaptability to dynamic research workflows, and has been predominantly adopted in settings requiring adaptive multi-step reasoning and tool interaction. In contrast, fine-tuning emphasizes precision, reproducibility, and task-specific robustness, and has been more commonly used for narrowly defined, high-precision extraction or classification tasks with standardized outputs. In porous materials research, where both exploratory creativity and reliable information extraction are essential, the optimal strategy often lies in combining these approaches.
A lower temperature sharpens the probability distribution, making the model increasingly deterministic, whereas a higher temperature flattens the distribution and increases diversity. Most LLMs default to a temperature of around 1.0. In the OpenAI API documentation, lower values (e.g., 0.2) are described as producing more focused and deterministic outputs, whereas higher values (e.g., 0.8) lead to greater randomness.138 In the porous materials studies reviewed here, temperatures in the range of 0.0–0.3 are typically used when accuracy and stability are critical, while values above 1.0 are employed for creative ideation tasks. Thus, selecting an appropriate temperature is essential for aligning model behavior with the objective of a given task.
In porous materials research, temperature settings are often task-dependent. Deterministic outputs are desirable for data extraction, entity recognition, and classification, whereas proposing novel synthesis routes or generating molecular candidates may require more diverse and exploratory outputs. This distinction is reflected in the case studies reviewed in Section 2. For instance, L2M3 and NERRE (Section 2.2) both employed a temperature of 0, ensuring consistent extraction of chemical entities and synthesis conditions from the literature.
By contrast, studies involving autonomous or generative workflows (Section 2.3) adopted a broader range of temperatures. In zeolite synthesis tasks, temperature values between 0.7 and 1.1 were used to generate diverse OSDA molecules.64 Meanwhile, in systems like dZiner and MOFSyn, temperature values were kept low (0.3 and 0.1, respectively) to minimize hallucinations and maintain reliability during generation.
Taken together, current applications suggest that LLMs in porous materials research are generally operated at relatively deterministic settings, especially when chemical accuracy, reproducibility, and safe autonomous execution are required. This trend reflects the current priority of reliability over creativity in most real-world workflows, even though higher-temperature sampling remains valuable for creative generative tasks.
To mitigate these risks and operationalize epistemic humility, several concrete strategies can be employed. These include self-consistency or multi-sampling to estimate output variance, retrieval-grounded generation with traceable evidence, and tool-augmented cross-checking such as ReAct.141 Rather than issuing unconditional recommendations, such mechanisms allow models to communicate confidence levels and flag high-uncertainty decisions for human oversight, which is essential for trustworthy autonomous experimentation.
The importance of structured safety validation is underscored by benchmarking efforts such as LabSafety Bench, which evaluates LLMs across scenario-based tasks including hazard identification, consequence assessment, and safe protocol recommendation.139 In these evaluations, many advanced models achieved average scores in the 60–70% range, with notable variability across task categories and frequent failures in complex multi-factor safety scenarios. The study reports recurrent cognitive errors, including misalignment of safety priorities in which models emphasized obvious risks such as fire while overlooking more severe toxic gas release scenarios, hallucinated chemical interactions lacking mechanistic basis, and incomplete recognition of compound hazard interactions within experimental protocols. These findings demonstrate that coherent natural-language reasoning does not guarantee accurate hazard ranking or consequence prediction, underscoring the need for domain-specific safety benchmarking and validation prior to experimental deployment.
In porous-material case studies reviewed in Section 2.3, safety considerations are present but not yet formalized as analytical pillars. In the ChatGPT Research Group example, safety guidance was implemented through a chemistry consultant agent that advised on microwave irradiation, pressure buildup, and chemical hazards.24 However, structured failure-mode classification or quantitative safety auditing was not central to the study design. MOFGen emphasized synthetic accessibility and structural validation, while the Zn-HKUST-1 robotic workflow focused on experimental acceleration and qualitative environmental reasoning.67,68 These examples indicate that safety components are present but not yet formalized as standalone safety architectures within porous-material SDL implementations.
To enhance the safety of LLM-integrated laboratory systems, recent studies argue that safeguards must intervene across reasoning, execution, and governance rather than relying solely on physical containment. The notion of cognitive safety proposes that experimental plans generated by a primary LLM be automatically screened prior to execution for hazards such as excessive pressure buildup, incompatible reagents, runaway exothermic conditions, or violation of predefined safety constraints, thereby reducing execution-stage hazards arising from flawed reasoning. Execution-level protection incorporates sensor-aware robotics, including vision systems that detect transparent glassware or human intrusion into shared workspaces and thermal imaging to flag abnormal heat signatures, alongside constrained motion planning that limits transfer velocities, pouring angles, and collision trajectories to reduce spills and mechanical impact. Because serious laboratory accidents are rare, digital twin environments are proposed to simulate unsafe mixing conditions, collision events, or equipment failures prior to deployment, enabling stress testing under limited empirical accident data. Governance mechanisms complement these safeguards through structured risk assessment based on likelihood, severity, system complexity, and autonomy level, traceable logging of AI-generated decisions for accountability, near-miss reporting practices, and alignment with frameworks such as the EU AI Act, ISO 42001, and transparency standards including PRISMA-AI. For porous materials research, where solvothermal synthesis in sealed autoclaves and high-pressure or microwave-assisted reactions are common, integrating such layered safety architectures will be increasingly important as LLM-driven experimentation advances toward greater autonomy. Future work in LLM-driven porous materials research should integrate safety evaluation and autonomy-level risk auditing alongside performance metrics.
Despite these significant advances, our discussion highlighted several remaining limitations, including high model costs, data imbalance in the scientific literature, the limited robustness of current autonomous laboratory frameworks, and the need for structured safety validation. We also discussed the trade-offs between prompt engineering and fine-tuning strategies, and the influence of temperature settings on the determinism, stability, and creativity of model outputs.
Looking ahead, several promising directions are emerging. Beyond model-level innovation, advancing LLM-driven porous materials research will require shared community infrastructure, including standardized benchmarks, curated evaluation datasets, reproducible safety guidelines, and evaluation frameworks capable of assessing genuine scientific novelty rather than performance on narrow task-specific metrics. Multimodal LLMs capable of jointly reasoning over experimental text, molecular structures, and image data like X-ray diffraction (XRD) are expected to expand the scope of tasks that may be more robustly supported through automation. In parallel, integrating LLMs with structured knowledge sources such as knowledge graphs (KG) may provide pathways toward more interpretable and constraint-aware reasoning. By retrieving relevant subgraphs rather than noisy text chunks, this KG enables the model to perform complex, multi-hop queries, such as filtering materials by precursors, application, and stability simultaneously, while promoting responses that are more transparently grounded in traceable, literature-derived evidence. This integration can help mitigate naming ambiguities by resolving coreferences to correct crystal structures, providing factually accurate answers as verified by expert evaluations. LLMs are also expected to progress from assisting as experiment planners to operating as higher-level supervisors within self-driving laboratory ecosystems. Ensuring chemical validity, minimizing hallucinations, and maintaining operational and ethical safety will be critical for this transition, highlighting the need for structured failure-mode analysis, domain-specific safety benchmarking, supervisory monitoring architectures, and accountable governance mechanisms in LLM-driven workflows. Real-world deployment of such systems will further require interoperable laboratory infrastructure, including hardware standardization, integration with laboratory information management systems (LIMS), implementation of physical safety interlocks, and alignment with emerging laboratory automation standards such as Synthetic Procedure Language (SPL), alongside appropriate regulatory and liability frameworks.
With continued advances across these areas, LLMs have the potential to evolve from supportive computational assistants into enabling technologies that contribute to increasingly autonomous and scientifically reliable self-driving laboratory ecosystems for porous materials research.
| This journal is © The Royal Society of Chemistry 2026 |