Open Access Article
Shuai Lv
a,
Lei Peng
a,
Shizhe Jiao
b,
Yufan Yao
b,
Wentiao Wu
a and
Wei Hu
*a
aSchool of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, Anhui 230026, China. E-mail: whuustc@ustc.edu.cn
bHefei National Research Center for Physical Sciences at the Microscale, University of Science and Technology of China, Hefei, Anhui 230026, China
First published on 23rd April 2026
Large language models (LLMs) have emerged as powerful tools for general-purpose tasks, but their performance in domain-specific applications, particularly in material property prediction, is still constrained by limited access to specialized knowledge. To overcome these challenges, we introduce ChatMat, an artificial intelligence (AI) chemist powered by a multi-agent system, capable of performing complex material property predictions with minimal human intervention. By leveraging LLMs such as GPT-4o or local foundation models, ChatMat autonomously interprets unstructured textual prompts, plans scientific procedures, and executes complex materials workflows from data retrieval to simulation and modeling with minimal human input. The system is orchestrated by a Manager agent, which interfaces with human researchers and coordinates four role-specific agents: Property Depositor, Computing Designer, density functional theory (DFT) Operator, and Machine Learning-driven Potential Energy Surface (ML-PES) Performer. This modular, multi-agent architecture enables the seamless integration of data-driven and physics-based techniques, establishing a robust, autonomous pipeline for material prediction and novel material exploration. We demonstrate the versatility and efficacy of ChatMat through four experimental tasks of increasing complexity, including structure generation, charge density distribution acquisition, database operation, and ML-PES construction. Furthermore, a series of quantitative evaluation metrics have been designed to benchmark its performance, illustrating ChatMat's reliability and adaptability across diverse materials domains. Our work bridges the gap between autonomous experimental research and computational science, showcasing the potential of domain-specific autonomous research to accelerate material prediction and exploration.
Despite their remarkable success in open-domain applications, LLMs encounter challenges in specialized, domain-specific tasks, such as material-property prediction.13,14 Unlike their rapid adoption in fields like medicine,15–18 LLMs have had a limited impact on property prediction.19 Four key challenges contribute to this gap: (i) material systems lack representations that convey their structural complexity in text-compatible forms, which restricts the expressiveness of LLMs;20,21 (ii) the scarcity of domain-specific datasets prevents LLMs from capturing the intricate physico-chemical relationships that define materials;7,22 (iii) laboratory workflows are still largely manual, and automation in materials research lags behind other scientific domains.23 (iv) LLMs and machine learning-based models inherently struggle to incorporate true physical and chemical laws, limiting their ability to fully replicate real-world material behavior beyond surface-level correlations.24–26 These challenges underscore the need for an LLM-based agent capable of effectively predicting material properties while simultaneously exploring new materials and scientific hypotheses with minimal human intervention.
A promising solution lies in the emergence of autonomous agents that augment LLMs with domain-specific tools and plugins.27 Early attempts to adapt large language models to scientific problems focused on prompt design enriched with domain-specific knowledge.8,28–33 However, this approach remains constrained to static textual priors.34 Furthermore, their execution of the computational or experimental procedures necessary for real-world discovery is often insufficient. For instance, early tool-integration strategies33 exhibit an over-reliance on pre-existing databases, resulting in a lack of flexibility to dynamically route tasks and reduce computational overhead.
Moving beyond static prompting, several studies have proposed hierarchical agent architectures that integrate an LLM “brain” with specialized tools.35–39 Recent developments in autonomous agent-based systems have underscored the need for specialized tools that integrate LLMs with expert-driven workflows. Simultaneously, despite the advancements in LLM adoption, property prediction in computational materials science remains largely dependent on two computational paradigms, density functional theory (DFT) simulations and machine learning-driven potential energy surface (ML-PES).40,41 While DFT simulations, which solve the electronic Schrödinger equation, are widely applicable, they are computationally expensive.42 ML-PESs, while offering considerable speed advantages, often face limitations in generalization due to their dependence on training data, relying heavily on DFT baselines for accuracy.43,44 Therefore, a balance between these methods is essential for accurate materials property prediction.
To address these gaps, we introduce ChatMat, an autonomous multi-agent AI chemist that integrates DFT simulations and ML-PESs through a unified LLM interface. By seamlessly alternating between DFT and ML evaluations, ChatMat provides accurate and efficient property predictions without the need for expert intervention. The system consists of specialized sub-agents: Property Depositor, Computing Designer, DFT Operator, and ML-PES Performer, which collaborate under a central Manager (Fig. 1) to plan, execute, and validate results. This division of labor ensures comprehensive, end-to-end knowledge of the property-prediction pipeline while maintaining flexibility for future expansions allowing for the integration of new tools with minimal engineering effort. In doing so, ChatMat reduces the barrier to accurate materials modeling, making it accessible to both specialists and non-experts alike, and markedly accelerates material discovery through both prediction and exploration.
By leveraging the capabilities of the LLM and LangChain (https://github.com/langchain-ai/langchain), the Manager agent interacts with human researchers using natural language. It interprets the demand description, decomposes tasks, and instructs the role-specific agents to perform the specified operations autonomously, thereby facilitating an automated, decentralized experimental workflow based on the input description. As shown in Fig. 2(a), this multi-agent collaborative process requires the Manager's LLM to reason about the task's current state, assess its relevance to the final goal, and plan the next steps accordingly, thereby demonstrating its level of understanding. After reasoning in the “Thought” step, the Manager selects a specific subordinate agent (preceded by the keyword “Action”) and generates the corresponding instruction for this agent (preceded by the keyword “Action Input”). The Manager's text generation then pauses, and the system delegates the execution to the requested agent using the provided input. The result is then returned to the Manager's LLM, preceded by the keyword “Evaluate”, and the Manager proceeds to the “Thought” step again. This iterative multi-agent coordination continues until the final answer is reached. This reasoning loop operates synchronously in that the Manager halts its generation process while the worker agent performs the task and awaits the return signal. This design facilitates robust dynamic error handling. If a subordinate agent's execution fails, the error log is captured and returned as the “Evaluate” signal. The Manager then detects the failure and autonomously attempts to devise a corrective plan by either re-instructing the current agent or routing the task to an alternative agent. A maximum retry limit (set to three in our experiments) is enforced to prevent infinite loops, ensuring that the system either self-corrects or halts to report persistent issues.
Each role-specific agent operates with its own LLM instance(s) to function autonomously within its designated domain (Fig. 1). The Property Depositor functions as an autonomous coding and retrieval agent to write Python code and mine the Material Property Database. The Computing Designer receives instructions from the Manager regarding intended experiments, plans computational tasks, and generates step-by-step procedures. The DFT Operator acts as an independent execution agent that analyzes the structural file, converts it into the required format for implementation, and generates the DFT software startup code, which it then initiates. The ML Performer searches for suitable pre-trained machine learning models in the Model Library, utilizing them as provided or autonomously enhancing them by expanding the model and incorporating additional training data.
ChatMat leverages the multi-agent architecture depicted in Fig. 1 to address this challenge. When material property queries are submitted by users in textual form, ChatMat responds by coordinating these specialized agents to provide a detailed, scientifically accurate description related to the material in question.
![]() | ||
| Fig. 3 Detailed architectures of the four role-specific agents. (a) Property Depositor. (b) ML Performer. (c) DFT Operator. (d) Computing Designer. Insets depict the specific underlying computational tools, APIs, and model architectures integrated into each agent's workflow. ML: machine learning; DFT: density functional theory; API: application programming interface; ML-PES: machine learning-driven potential energy surface (panel (c) is adapted and extended from our prior framework33 to reflect the newly introduced multi-agent orchestration logic). | ||
For material property queries where precalculated look-up table data are unavailable, computational simulations offer a valuable alternative. However, these simulations are time-consuming and resource-intensive. In previous single-agent systems, the LLM was tasked with manually selecting calculation tools, which frequently led to context dilution under complex tasks. The optimal resolution in ChatMat's decentralized architecture is to allow the Computing Designer to algorithmically judge and route the task to the appropriate specialized agent, as shown in Fig. 3(d). In this work, the Computing Designer determines whether to employ the DFT Operator or the ML Performer based on the following methodology.47
Given a configuration
, with t representing a continuous or discrete series of operations, we define the error indicator εt as the maximal standard deviation of the atomic force predicted by the model ensemble,
![]() | (1) |
denotes the force on the atom with index i predicted by the model Ew and ∇i denotes the derivative with respect to the coordinate of the i-th atom. Both of the notations 〈…〉 in eqn (2) denote the expectation with respect to the ensemble of models and are estimated using the average of model predictions. For example,
is estimated using
![]() | (2) |
The ensemble model deviation serves as a robust indicator of the true prediction error, effectively distinguishing between the reliable interpolation regime and the unreliable extrapolation regime.48 In our multi-agent approach, the selection of the appropriate execution agent for predicting material properties is governed by a dynamic threshold σlo. This threshold is carefully chosen rather than arbitrarily determined.47 Specifically, σlo is set slightly above the ML Performer's intrinsic fitting error, defined as 1.1 times the training root-mean-square error (RMSE) of the forces. This margin accounts for the inherent noise in the training data while ensuring that the ML model is trusted only within its verified accuracy regime. This guarantees that the system is not overly confident in its predictions for configurations that are less reliable, while still leveraging the high-throughput capabilities of the ML Performer where appropriate.
To facilitate this autonomous routing, we design Algorithm 1 to classify molecular configurations into two sets: Rml for the ML Performer and Rdft for the DFT Operator. Configurations in Rml are those for which the predicted atomic forces fall within an acceptable error range, εt < σlo, ensuring that the pre-trained ML models can be confidently utilized by the ML Performer. Conversely, configurations in Rdft exhibit larger errors (εt ≥ σlo), indicating the need for the Computing Designer to escalate the task to the DFT Operator for more accurate first-principles calculations.
Therefore, for a new structure, the Computing Designer algorithmically determines whether it belongs to Rml or Rdft based on the computed error indicator εt. This decentralized decision-making process ensures that the most appropriate and computationally efficient sub-agent is chosen for each material property prediction, retaining system robustness without encumbering the Manager agent's primary reasoning loop.
| ĤΨ = EΨ, | (3) |
![]() | (4) |
The Kohn–Sham equations, central to DFT, are given by
![]() | (5) |
![]() | (6) |
Additionally, Ab initio Molecular Dynamics (AIMD) integrates these calculations into the equations of motion:
![]() | (7) |
![]() | (8) |
As shown in Fig. 3(b), due to its ability to efficiently handle high-dimensional potential energy surfaces and accurately model complex material behaviors through deep learning techniques, the DeePMD framework is employed autonomously by the ML Performer. ML-PES models provide substantial reductions in computational costs while retaining comparable accuracy to first-principles methods. By training on extensive datasets derived from DFT calculations, these machine learning models can learn complex relationships between molecular structures and their properties.43 A common approach involves representing the potential energy EPES(R) as a function learned by the ML model:
| EPES(R) ≈ ML Model(R), | (9) |
Techniques such as neural networks, Gaussian processes, and kernel ridge regression have been employed to develop PES models capable of predicting properties like binding energies, reaction rates, and molecular conformations with impressive speed. For instance, the Neural Network Potential (NNP) can be expressed as
![]() | (10) |
is the neural network function applied to the symmetry functions Gi, which represent the local environment of the i-th atom, and N is the total number of atoms in the system.
In this work, we utilize the DeePMD, which employs deep neural networks to accurately represent the PES. It maps the symmetry functions to an energy contribution
![]() | (11) |
represents a different symmetry function for the i-th atom.
Then, the DeePMD model expresses the total energy as a sum of atomic contributions as
![]() | (12) |
In particular, the schematic diagram of the relationship between the tasks from experimental procedures to automated execution of DFT and ML is shown in Fig. 2(b). Upon successful prediction, the ML Performer returns the computed properties to the Manager via the “Evaluate” signal, completing the multi-agent asynchronous workflow without encumbering the Manager's cognitive capacity.
| Query | GPT-4o response | ChatMat response |
|---|---|---|
| Could you list the details of the atomic forces section for H2? | The atomic forces section for H2 typically refers to the forces acting on each atom in the molecule (incorrect) | The details of the atomic forces section for H2 are as follows: force for centroid (x) [a.u.]: −1.963 × 10−5, force for centroid (y) [a.u.]: 3.83 × 10−6 (correct) |
| What is the free energy of Al2O3? | I cannot directly provide the free energy value for Al2O3 (incorrect) | The free energy of Al2O3 is 29.8450654 eV (correct) |
| Could you tell me the force along z for Na2SO4? | I cannot provide a specific value (incorrect) | The force along the z-axis for Na2SO4 is 1.62197009 atomic units (a.u.) (correct) |
| Please construct a machine learning potential energy surface model for Al176Si24 | I cannot provide a specific result for constructing a machine learning potential energy surface model for Al176Si24 (incorrect) | I have completed the construction of the machine learning potential energy surface model for Al176Si24 and saved it in the model library (correct) |
| What does the structure of CaCO3 look like? | I cannot provide a specific result for the structure of CaCO3 (incorrect) | The structural information of CaCO3 has been obtained (correct) |
| What is the charge density distribution acquisition of BaSO4? | I cannot provide a specific value for the charge density distribution of BaSO4 (incorrect) | The charge density distribution acquisition of BaSO4 has been obtained (correct) |
Subsequently, to systematically assess the automation capabilities, we designed four representative tasks crucial to materials property prediction workflows. The definitions of these tasks are as follows:
We analyzed the performance of ChatMat across these tasks, with the results summarized in Fig. 4. For T1, the agent successfully constructed the precise 3D crystal structure of Al176Si24. Regarding T2, ChatMat executed the calculation workflow to derive the charge density for (H2O)64. For the complex T3, the system successfully completed the training of an ML-PES model for the large-scale (HClO4)16 supercell. Finally, for T4, ChatMat demonstrated its capability to autonomously build, operate, and maintain a comprehensive Material Property Database.
Each task was evaluated using a carefully constructed benchmark dataset, consisting of 100 natural-language queries per task. These queries were specifically designed to cover a diverse range of materials and properties to mitigate potential selection bias, thereby rigorously assessing ChatMat's autonomy, accuracy, and domain-specific competence. ChatMat integrates both DFT simulation software and ML-PES models, combining data-driven predictions with physics-based simulations. This hybrid architecture enables it to perform complex materials science workflows, including both data-driven predictions and physics-based simulations.
Task outcomes were classified into three categories: True (responses that fully met the task requirements), Formatting False (responses containing errors in grammar, spelling, or formatting that impaired readability), and Factual False (responses with incorrect or logically inconsistent content). This labeling framework allows for a comprehensive and nuanced evaluation of ChatMat's reliability and effectiveness in a variety of scientific scenarios. In this work, ‘True’ refers to an accurate match within an acceptable approximate range.
To quantitatively assess performance, we adopted four evaluation metrics: success rate, factuality rate, formatting compliance rate, and composite reliability score. These metrics were redefined from standard classification metrics to better reflect the three-way outcome structure (True, Factual False, and Formatting False) of our specific task. They are calculated as follows:
![]() | (13) |
![]() | (14) |
![]() | (15) |
![]() | (16) |
These metrics were selected for their ability to capture complementary aspects of performance. Success rate provides a general measure of overall correctness. The factuality rate reflects the system's ability to avoid factual errors, which is critical in scientific and engineering applications where misinformation can lead to flawed conclusions. The formatting compliance rate captures linguistic and presentation quality by penalizing formatting issues that may hinder human interpretation. Finally, the composite reliability score, calculated as the harmonic mean of the factuality and formatting compliance rates, offers a conservative assessment of performance. Since the harmonic mean is mathematically biased toward the lower value, this metric ensures that the system is heavily penalized if it fails in either content validity or output specifications, thereby enforcing a rigorous standard for scientific reliability.
We conducted each task three times to minimize potential random errors. The total statistical results of the four tasks across the three runs are presented in Fig. 5. This figure provides a macroscopic view of the error distribution, categorizing outcomes into True, Factual False, and Formatting False to intuitively illustrate the system's overall failure modes. Complementing this, Fig. 6 offers a microscopic view by quantifying the specific performance metrics for each independent run side-by-side. This detailed breakdown demonstrates the system's stability and consistency across repeated experiments, capturing variations that are obscured in the aggregated distribution.
![]() | ||
| Fig. 5 Total statistical results of the four tasks after three runs, showing the distribution of errors categorized as True, Formatting False, and Factual False. | ||
Across all tasks, ChatMat consistently demonstrated high performance, with the success rate ranging from 91% to 96%. In particular, tasks such as Material Property Database Operation and Material Structure Acquisition yielded the highest average success rates of 95% and 94%, respectively, highlighting ChatMat's strong capabilities in retrieving material properties and processing structural information. For more computationally demanding tasks like ML-PES Construction and Charge Density Distribution Acquisition, the system retained robust performance with success rates of approximately 92% and 93%.
Further analysis of factuality rate, with all task scores exceeding 95.8%, indicates that ChatMat effectively minimizes factual inaccuracies, underscoring its high scientific reliability. Similarly, the consistently high Formatting Compliance Rate scores reflect ChatMat's strong adherence to linguistic and formatting standards, which is an essential requirement for seamless integration into human-in-the-loop workflows. The resulting Composite Reliability Scores, all above 95.3%, confirm ChatMat's balanced performance in both producing scientifically valid content and presenting it in a clear, interpretable manner. These results collectively validate ChatMat as a reliable and effective intelligent agent for complex, domain-specific computational workflows.
Traditional methods often struggle to translate the complex structural and chemical characteristics of materials into data that can be effectively processed. ChatMat tackles this challenge by leveraging carefully designed prompts to interpret and convert complex material descriptions into usable data formats. By incorporating these specialized prompts, ChatMat ensures that both textual input and computational models are seamlessly integrated, facilitating the accurate generation of material structures and the execution of physics-based simulations. This was demonstrated in Material Structure Acquisition (T1) and Charge Density Distribution Acquisition (T2), where ChatMat autonomously generated accurate 3D molecular structures, effectively bridging the gap between textual input and computational predictions.
ChatMat mitigates the challenge of insufficient high-quality, domain-specific datasets in the chemical field by integrating ML-PES and DFT methods. This hybrid approach enables it to generate accurate predictions even when faced with sparse datasets, specifically in the cold start scenario where initial domain-specific training data are non-existent or minimal. The Material ML-PES Construction task (T3) demonstrates ChatMat's ability to autonomously manage datasets and select optimal hyperparameters for model training, overcoming the limitations of sparse, high-quality data. Moreover, the integration of DFT simulations allows for the continuous generation of high-quality data. ChatMat's operation of Material Property Database Operation (T4) highlights its capacity to autonomously build, maintain, and refine material property databases, thereby expanding its knowledge base and improving its predictive accuracy over time.
A longstanding issue in computational materials science has been the heavy reliance on manual operations, resulting in workflows that lag behind the automation seen in other scientific disciplines. This situation often requires researchers to invest a lot of time and effort in performing repetitive and tedious tasks when dealing with complex calculations and experimental processes, greatly limiting the efficiency of material exploration. ChatMat is capable of autonomously executing end-to-end workflows, representing a transformative shift toward more efficient, automated systems. As illustrated in Fig. 7, ChatMat autonomously executed complete end-to-end workflows, integrating database retrieval, ML-based property prediction, and DFT-level simulations to compute and compare total energies and centroid forces. The system's ability to autonomously coordinate these tasks exemplifies its robustness and flexibility in handling complex, real-world materials problems. This case underscores ChatMat's potential to enhance autonomous scientific exploration and greatly accelerate research in computational materials science. ChatMat's ability to perform these tasks with minimal human oversight represents a transformative shift toward more efficient, automated workflows that could accelerate the pace of materials exploration.
Another longstanding challenge in computational materials science is the difficulty in incorporating true physical laws into machine learning models, often resulting in predictions that are accurate in some contexts but fail to generalize across complex or highly variable systems. By integrating DFT simulations, which solve the Schrödinger equation and other fundamental physical models, ChatMat ensures that predictions are derived from established physics while retaining the computational efficiency of ML methods. This hybrid approach enables ChatMat to strike a critical balance between computational efficiency and accuracy, addressing the common trade-off between these two factors in current agent-based systems for materials modeling.
An ablation study further demonstrates ChatMat's advantages by comparing it to traditional methods based solely on either ML-PES or DFT simulations using a benchmark set of 100 randomly selected material property queries. As summarized in Table 2, the ML-only agent exhibits a limited success rate of 48%, constrained by the quality and representativeness of their training data. The DFT-only agent performed more accurately, achieving an 88% success rate, but suffered from high computational costs and limited scalability. Specifically, in unattended batch processing, 3% of calculations failed due to SCF non-convergence, 7% due to timeouts, and 2% due to I/O errors. In contrast, ChatMat achieved a superior success rate of 95% by integrating both ML and DFT techniques within a unified LLM-driven framework. This integration enables it to strike a critical balance between computational efficiency and quantum-level accuracy, effectively overcoming a central trade-off in current agent-based systems for materials modeling.
| Agent | Success rate | Formatting compliance rate | Factuality rate | Composite reliability score |
|---|---|---|---|---|
| ML-only | 48% | 46% | 49% | 42% |
| DFT-only | 88% | 80% | 83% | 86% |
| ChatMat | 95% | 91% | 93% | 94% |
To quantitatively substantiate these efficiency improvements, we evaluated the computational cost, measured using the number of DFT calls, across the aforementioned benchmark of 100 diverse queries. In the DFT-only agent, every property evaluation necessitates a full first-principles calculation. Accounting for the system's error-recovery mechanism, which allows a maximum of three retries per task, the DFT-only agent required an average of 1.15 DFT calls per query. By contrast, ChatMat leverages the Computing Designer agent to dynamically route tasks to the ML Performer when the predicted error falls within the confidence threshold. Through this intelligent routing, ChatMat reduced the average number of DFT calls to 0.35 per query, representing a computational cost saving of approximately 69.6%. Even in the worst-case scenario, such as encountering highly out-of-distribution structures where the ML model completely falls back to the DFT Operator and triggers maximum error retries, ChatMat is capped at exactly 3 DFT calls. This ensures a rigorous lower bound on accuracy without ever exceeding the DFT-only agent's maximum computational cost. This closed-loop quantification indicates that ChatMat enhances the overall success rate while reducing average computational overhead. By intelligently routing tasks to the ML-PES Performer when predictions fall within the defined confidence threshold, the system reduces the average number of DFT calls per query from 1.15 to 0.35. This represents a 69.6% reduction in computational cost, establishing a measurable improvement in efficiency while retaining the established accuracy bounds of quantum mechanical simulations.
By seamlessly integrating diverse computational paradigms, including machine learning, first-principles simulations, and LLM-driven orchestration, ChatMat delivers high-accuracy predictions while retaining computational efficiency. This hybrid architecture addresses key limitations of conventional methods and positions ChatMat as a powerful, efficient, and accessible tool for materials modeling. By supporting both domain experts and non-specialists, ChatMat facilitates collaborative and autonomous scientific exploration, thereby paving the way for accelerated innovation in computational materials science.
Supplementary information (SI): the framework outputs and the main agent setup. See DOI: https://doi.org/10.1039/d5dd00582e.
| This journal is © The Royal Society of Chemistry 2026 |