Open Access Article
Fengxu Yang
a,
Weitong Chenb and
Jack D. Evans
*a
aSchool of Physics, Chemistry and Earth Sciences, Adelaide University, Adelaide 5005, Australia. E-mail: j.evans@adelaide.edu.au
bSchool of Computer and Mathematical Sciences, Adelaide University, Adelaide 5005, Australia
First published on 13th April 2026
Large language models (LLMs) are rapidly transforming materials science. This review examines recent LLM applications across the materials discovery pipeline, focusing on three key areas: mining scientific literature, predictive modelling, and multi-agent experimental systems. We highlight how LLMs extract valuable information, such as synthesis conditions from text, learn structure–property relationships, and can coordinate agentic systems integrating computational tools and laboratory automation. While progress has been largely dependent on closed-source commercial models, our benchmark results demonstrate that open-source alternatives can match performance while offering greater transparency, reproducibility, cost-effectiveness, and data privacy. As open-source models continue to improve, we advocate their broader adoption to build accessible, flexible, and community-driven AI platforms for scientific discovery.
The discovery of materials, such as metal–organic frameworks (MOFs),7–9 has become pivotal to breakthroughs in areas like energy storage, catalysis, and chemical separation.10,11 However, the traditional labour-intensive and trial-and-error research paradigm in materials research moves at a slow pace. Decades of research have produced a vast, yet fragmented, archive of knowledge scattered across millions of scientific publications, among other data repositories.12 Therefore, systematically extracting unstructured data with minimal manual effort is crucial for data collection and analysis. Traditional techniques such as regular expressions (RegEx) and part-of-speech tagging rely heavily on rule-based patterns. Although transformer-based extraction methods represent a significant shift toward statistical learning, they are still often constrained by the scope of carefully curated training datasets. As a result, these approaches may struggle to capture the full diversity and variability of natural language expressions.13 In contrast, by leveraging their extensive pre-trained knowledge, LLMs offer an advanced opportunity to understand the intricate chemistry language and textual context, enabling them to process unstructured data with higher flexibility without the need for domain-specific training or fine-tuning.14,15
Building upon clean, actionable data as a foundation unlocks new possibilities in materials science research. One of the ambitious goals in materials science research is to establish the structure–property relationships that govern material performance. By training on vast and comprehensive datasets of chemical information, LLMs can potentially learn these intricate connections and provide valuable insights into the fundamental principles.16,17 Additionally, LLMs are transitioning from passive assistants to active participants in the research process. The most advanced applications now integrate LLMs as a central “brain” into research workflows, where these agentic systems can plan multi-step procedures, interface with computational simulation tools, and even operate robotic platforms.18–20
LLMs are poised to reshape the materials science landscape. This potential, however, has largely been explored using closed-source, commercial models such as the GPT series from OpenAI. These models often benefit from diverse and extensive training data that capture a broader spectrum of knowledge, along with proprietary reinforcement learning from human feedback (RLHF) pipelines, which are key to stronger reasoning capabilities and better semantic alignment essential for handling complex tasks.21 While these models are industry-leading, their closed-source nature presents many drawbacks: high costs for large-scale or high-throughput tasks, data privacy concerns, reduced reproducibility, and limited flexibility for model customisation.22 In parallel, the open-source LLM ecosystem has expanded significantly. While “open-source” in the context of LLMs typically refers to accessible weights rather than fully transparent training data or pre-training code, the ability to adapt these models through fine-tuning or reinforcement learning is generally sufficient for most research applications. The release of the Llama 3 family23 by Meta in early 2024 marked the first time open-source models achieved true commercial-grade competitiveness with their closed-source counterparts. This milestone established a strong foundation for both research and industry applications. Subsequently, the Qwen24 and GLM25 series have made substantial progress toward matching and even surpassing proprietary models.
In this review, we outline the use of LLMs in materials science applications, with particular focus on MOFs. We also examine the capabilities of rapidly emerging open-source models across these diverse tasks.
In the work by Ghosh et al., an LLM-driven workflow was developed to autonomously draw out key thermoelectric properties (e.g., Seebeck coefficient, thermal conductivity) and associated structural properties (crystal class, space group, and doping strategy) from approximately 10
000 materials science articles.27 They also benchmarked different Gemini and GPT models and found GPT-4.1 mini offered the best cost-performance balance. This effort resulted in the creation of the largest LLM-curated thermoelectric dataset which contains 27
822 temperature-resolved property records for a diverse class of materials. A key strength of this work lies in its explicit focus of tables and their associated captions as distinct, high-value data sources, rather than relying solely on unstructured text. Similarly, Li et al. developed an extraction workflow called “ReactionSeek” which is capable of directly interpreting reaction scheme images using a multimodal LLM (GLM-4V) and achieved an accuracy of 91.5% when tested on a set of 42 diverse images.28 These multimodal expansions effectively broaden the scope of accessible data and enable a more comprehensive understanding of scientific information. It is worth noting that the models employed here are open-source and demonstrate impressive performance, highlighting the growing potential of community-driven models in specialised scientific applications.
Towards the development of MOFs, Pruyn et al. developed “MOF-ChemUnity” (Fig. 1) that not only extracts key information such as material properties and synthesis procedures, but also links the various names used for these materials to their corresponding co-reference names and crystal structures.6 This linkage bridges between textual synthesis, property knowledge and the atomic-level structural insights. Finally, the mined datasets form a knowledge graph that serves as a structured, scalable, and queryable foundation for materials discovery. However, while this approach captures the important details such as the overall synthesis duration, it only extracts static attributes and ignores the sequential order and relationships among synthesis actions. The work by Zhao et al. directly targets this gap by presenting a “sequence-aware” extraction, capturing the step-by-step experimental workflow as a directed graph, where each node represents an action (e.g., “mix”, “heat”, “filter”), and edges define the experimental sequence.29 This workflow achieved high F1-scores for both entity (0.96) and relation (0.94) extraction. All of these studies highlight a significant shift from simple data extraction toward creating dynamic, AI-ready knowledge bases that enable sophisticated, data-driven discovery.
![]() | ||
| Fig. 1 The MOF-ChemUnity workflow. LLMs are used to link publications and CSD entries by extracting experimental properties and matching compound names across literature and structure files. This structured data populates the knowledge graph which combines synthesis, applications and more. Reproduced from ref. 6, licensed under CC BY-NC 4.0. | ||
To demonstrate the performance of open-source models on these data extraction tasks, we reproduced the benchmark for six synthesis conditions provided by the MOF-ChemUnity code repository.30 The models tested included the Qwen3 and GLM-4.5 series, featuring both dense and Mixture-of-Experts architectures with sizes ranging from 14B to 355B parameters. As shown in Fig. 2, most models achieved accuracies exceeding 90%, with the largest model reaching 100%. This highlights the strong potential of open-source models which demonstrates their capability to effectively handle data-mining tasks. Notably, small models such as Qwen3-32B yielded an accuracy of 94.7%, suggesting that compact models can also handle the task effectively. This is significant as these smaller models require far fewer computational resources; Qwen3-32B, for instance, can be readily deployed on a standard Mac Studio with an M2 Ultra or M3 Max chip. The original study divided full-text literature into smaller chunks during pre-processing to identify relevant experimental paragraphs, which were then fed to GPT-4o for extraction. While this approach helps narrow the search space, it inevitably results in some loss of contextual detail, potentially omitting valuable features. In contrast, we processed entire papers using open-source models, which enabled the capture of additional information dispersed throughout the document that may have been missed by chunk-based approaches.
Further, the work by Liu et al. demonstrated the predictive performance for MOF properties by providing compositions as well as high-level structural features like node connectivity and topology through rich, natural language descriptions.32 The fine-tuned model achieved 94.8% accuracy in predicting hydrogen storage performance which is a substantial 46.7% improvement over models using only the precursor names.
To better encode comprehensive atomic-level details for LLMs, Song and colleagues proposed a new material representation format called “Material String” (see Fig. 3), which is designed to be significantly shorter and more information-dense than standard crystal structure files like CIF or POSCAR.33 This atomic-level representation encodes essential structure details such as space group, lattice parameters, and Wyckoff positions, allowing the complete mathematical reconstruction of a material's primitive cell in 3D. The fine-tuned model showed remarkable accuracy on the synthesisability test (98.6%). More importantly, it exhibited excellent generalisation, maintaining an average accuracy of 97.8% even when tested on complex experimental structures with up to 275 atoms and is far beyond the 40-atom limit of its training data. The model also achieved impressive performance on prediction of synthesis routes (91.0%). All of these results collectively underscore the ability of LLMs to learn and capture complex structural patterns or property features. They also highlight the importance of incorporating high-quality structural representations as input features to enable more reliable and physically meaningful predictions.
![]() | ||
| Fig. 3 Framework for predicting material synthesisability and synthesis routes (a). t-Distributed Stochastic Neighbor Embedding (t-SNE) visualisation of the material structures used in the dataset that combined both experimental structures and non-synthesisable structures (b). Material string encoding structural data is used to train a “Synthesizability LLM” (c). Reproduced from ref. 33, licensed under CC BY-NC-ND 4.0. | ||
To evaluate the capability of open-source models on prediction tasks, we fine-tuned three models of varying model sizes and architectures on the training dataset provided by L2M3.34 As the official test set was unavailable, we further split the training dataset into 85% (4990 samples) and 15% (1039 samples) for training and evaluation, respectively. We employed Low-Rank Adaptation (LoRA) with a rank of 32 for efficient fine-tuning, which enabled the largest model tested, GLM-4.5-Air, to fit within four AMD Instinct MI250X Accelerators. When combined with 4-bit quantisation, the fine-tuning could be accommodated on only two MI250X with a minor loss in accuracy.35 All models achieved a median score identical to that reported for GPT-4o (Fig. 4).
Although these models achieved similar performance, closer inspection revealed that the dataset is highly imbalanced (see SI). This imbalance can bias models toward the majority class, producing deceptively high accuracy while failing to learn meaningful patterns in the minority class. We also observed that one component of the similarity metric simply repeats the input precursors, yielding nearly 100% accuracy and further inflating the median score. These findings suggest that the models may memorise the most frequently occurring entries. Therefore, it is difficult to determine whether they genuinely learn the correlations between material formula strings and properties, or merely exploiting statistical frequency patterns in the dataset. This analysis also underscores the critical need for transparent reporting and data sharing. We observed that most studies in this domain lack sufficient detail for experimental reproduction, making it challenging to validate results or build upon existing work.
One of the most impactful roles of LLMs is supporting researchers in the exploration and refinement of their research ideas. As demonstrated by the SciAgents framework, LLMs can navigate a vast knowledge base to uncover previously unseen connections between disparate scientific concepts, which leads to the generation of novel hypotheses.37 In addition, this framework can iteratively refine and elaborate on initial concepts. Different AI agents within the SciAgents framework can expand upon a hypothesis by adding quantitative details, suggesting specific modelling or simulation priorities, and providing comprehensive critiques that identify strengths, weaknesses, and areas for improvement. This feedback loop effectively mimics and accelerates the traditional scientific process of discussion and peer review, ensuring that the resulting ideas are not only innovative but also scientifically rigorous.
LLMs also hold strong potential to function as central coordinators or even cognitive engines, bridging researchers with complex computational tools in the pre-experimental stage. By leveraging their reasoning capability, LLM-based agents can understand and process queries in natural language, eliminating the need for rigid, formal syntax or other technical knowledge for using a computational tool. The ChatMOF system exemplifies this by orchestrating a sophisticated pipeline built on three core components: an agent, a toolkit, and an evaluator.38 The toolkit contains a combination of the recognised databases such as QMOF and CoREMOF, and also a machine learning model that predicts material properties (e.g., hydrogen diffusivity), and a genetic algorithm tool for generating new combination of materials. When a user submits a query, the agent (powered by an LLM like GPT-4) functions as the “brain”. It analyses the request, formulates a multi-step plan to solve the problem, and selects the appropriate instrument from its toolkit. The evaluator then assesses the output from the tool and synthesises it into a final, coherent answer for the user.
A related system, QUASAR, extends this coordinator paradigm into first-principles simulation workflows.39 Rather than focusing on database querying and ML-driven property prediction, QUASAR is designed to deal directly with quantum and atomistic simulation. Upon receiving a task description such as calculating a band gap, the Strategist agent decomposes the request into a structured execution plan. The Operator agent then interprets each sub-task, performing detailed technical reasoning to construct validated input files for engines like Quantum ESPRESSO and LAMMPS. Crucially, the system functions not merely as an executor but as an execution-aware controller: it monitors runtime behaviour, diagnoses configuration issues (e.g., inappropriate energy cutoffs or convergence thresholds), applies corrective adjustments, and post-processes raw outputs into scientifically interpretable results.
Continuing the development of multi-agent systems, their scope has been further refined to focus on material discovery and optimisation. Zheng et al. developed a ChatGPT research group where seven distinct LLM-based assistants collaborate with a single human researcher, who only needs to specify the research objective through prompts.40 As a result, this system successfully accelerated the finding of optimal synthesis conditions for MOFs and covalent-organic frameworks (COFs) by coupling AI agents with Bayesian optimisation, balancing the exploration and exploitation of a vast parameter space and reducing millions of potential conditions to a manageable number. In a related effort focused on de novo discovery, MOFGen was presented as a system of agentic AI dedicated to discovering novel, synthesisable MOFs.41 This system employs a pipeline of specialised agents: LinkerGen, an LLM that proposes novel compositions; a diffusion model that generates 3D crystal structures; and other agents that perform quantum mechanical filtering and synthesisability analysis. This generative approach led to the successful synthesis of five “AI-dreamt” MOFs. These “vertical” applications demonstrate how such agentic structures can be specialised to solve deeper problems within a single domain. Moreover, all four systems highlight a core design principle that their strength lies not in the LLM alone, but in its ability to plan, delegate, and distribute sub-tasks across the entire research environment.
Moreover, coupling LLM agents with laboratory automation and robotic synthesis platforms can close the loop between computation and experiment. In the work by Boiko et al., an AI system named Coscientist was developed to autonomously design, plan, and perform complex experiments.42 Driven by GPT-4, the system coordinates a suite of tools for internet and documentation search, code execution, and experimental automation (Fig. 5). Crucially, Coscientist extended beyond in silico planning by directly controlling robotic liquid handlers for precise reagent transfers and managing heater-shaker modules to regulate reaction temperatures and mixing speeds. Its capabilities were demonstrated through the autonomous synthesis of organic compounds such as biphenyl and tolane. Similarly, Song et al. introduced ChemAgents, a multi-agent system powered by Llama-3.1-70B.43 This system also features a central Task Manager that coordinates four highly specialised agents: Literature Reader, Experiment Designer, Computation Performer, and Robot Operator. Each agent is explicitly linked to a foundational resource, such as a literature database or an automated lab, which makes it a highly flexible framework that is able to execute tasks ranging from literature review to robotic operation. Demonstrated tasks include FTIR characterisation of azobenzene molecules and the synthesis and PXRD characterisation of six metal oxides (including ZrO2 and ZnO). Notably, ChemAgents physically executes these experiments through Python-based Robot APIs, enabling coordinated control of a fully mobile robot and a benchtop arm across 20 automated stations for operations such as solid weighing, liquid transfer, and photocatalytic performance evaluation. Both Coscientist and ChemAgents illustrate the view of leveraging LLM agents for general-purpose, “horizontal” platforms that can interface directly with experimental hardware through programmatic control.
![]() | ||
| Fig. 5 A comprehensive system featuring a central LLM-based “Planner” that orchestrates and manages the entire research workflow. Reproduced from ref. 42, licensed under CC BY 4.0. | ||
The emergence of agentic systems marks a significant step toward the future of autonomous research. However, there are some crucial aspects to consider to ensure the robustness and reliability. At the computational level, the primary hurdle is the non-deterministic nature of LLMs. Unlike traditional hard-coded simulation workflows, an agentic system may generate slightly different reasoning paths or code structures for the same scientific query across multiple runs. In this sense, achieving a research objective is similar to solving a puzzle where the solution is not immediately apparent and multiple plausible paths must be considered. Therefore, these systems must possess the ability to self-correct by identifying when they have reached a dead end and recovering gracefully to maintain progress toward a valid solution. For example, in QUASAR's architecture design, Evaluator agent was used to evaluate the scientific rigorousness of task completion and fix potential problems spotted and form a feedback loop. Similarly, ChemAgents employs a hierarchical reflection and correction mechanism. Instead of relying on a single output, the system uses “critic” and “proofreader” agents to iteratively review, critique, and improve experimental procedures and robot codes against predefined expert rules.
At the physical level, the stakes of non-determinism are significantly higher. In systems like Coscientist, a minor variation in an LLM's interpreted instruction could result in an irreversible physical error, such as an incorrect reagent transfer or a mechanical collision. Because physical lab environments lack an “undo” function, orchestration must incorporate multimodal feedback loops. This means the Robot Operator agent cannot rely solely on the success of a Python API call; it must utilise computer vision and sensor data to verify that a vial is correctly seated or that a liquid transfer actually occurred. Moreover, the system must be execution-aware, capable of responding to asynchronous interruptions such as depleted reagents or hardware malfunctions without causing the entire multi-agent workflow to fail.
Finally, it is worth noting that the LLM used in ChemAgents, Llama-3.1-70B, is an open-source model. This represents a meaningful design choice, as most existing LLM-based agents in this domain continue to rely on closed-source systems, which involve certain trade-offs. Commercial models can cause significant financial cost and accessing them through APIs can expose systems to instability and security risks. APIs are prone to outages, disruptions, and unexpected updates that may compromise reliability, and their dependence on internet connectivity makes it difficult to ensure data privacy and compliance with confidentiality protocols, which is a major concern for sensitive experiments or proprietary research data. In contrast, the rapid advancement of open-source LLMs is beginning to change this landscape. Models such as GLM-4.5 and Qwen3-235B-Thinking-2507 already demonstrate agentic and reasoning performance comparable to that of leading closed-source counterparts.25 Therefore, selecting an appropriate system should strike a balance between performance, resource availability, and operational independence, and open-source development may ultimately pave the way for sustainable, locally deployable autonomous research systems.
Despite the rapid advancement of these agentic frameworks, coupling autonomous AI with physical laboratory hardware presents significant safety and operational risks. At the core of these challenges is the lack of deterministic safety bounds; unlike traditional industrial robots that operate on fixed, pre-verified paths, LLM-based agents generate “reasoning-driven” instructions that can be unpredictable or non-reproducible. This stochastic execution introduces the risk of catastrophic hardware collisions or hazardous chemical spills if the agent misinterprets a sensor reading or fails to account for the physical constraints of a benchtop arm. In addition, current robotic platforms remain fundamentally limited in their ability to handle unstructured anomalies, such as a cracked vial or a slightly misaligned microplate. These seemingly minor deviations can propagate uncertainty throughout the workflow, often forcing execution into a single-direction sequence with limited capacity for adaptive recovery.
Additionally, we would like to highlight some downsides of open-source models. Unlike pay-as-you-go proprietary models, local deployment requires substantial upfront investment in high-performance hardware to meet the VRAM demands of larger models. For instance, the largest model tested in this review, GLM-4.5 (355B total parameters), requires approximately 823 GB of VRAM for inference. Beyond hardware costs, the operational overhead, including the specialised expertise needed for model inference and fine-tuning, can pose a substantial barrier. These total cost of ownership considerations may outweigh the savings from avoiding API fees, particularly for researchers who do not require high-throughput processing or the stringent data privacy guarantees that local hosting provides.
However, evaluating the reliability of highly autonomous systems remains a challenge. While recent studies have attempted to benchmark the agentic abilities of LLMs such as tool calling,44 there is still not yet a comprehensive framework for assessing performance beyond one-step reasoning accuracy. Complex agentic systems require the integration of planning, execution, and adaptive decision-making, which current benchmarks do not capture. Moreover, existing metrics predominantly measure procedural correctness rather than the holistic resilience of reasoning when confronted with uncertainty or failure. Developing new benchmarks is therefore crucial to clarify the true competency of models acting as “researchers”. Such frameworks would not only guide the need for domain-specific fine-tuning, but also be essential for building trustworthy autonomous systems that can be confidently adopted in real-world scientific research.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5dd00499c.
| This journal is © The Royal Society of Chemistry 2026 |