Siyu
Liu
ab,
Tongqi
Wen
*ab,
Beilin
Ye
a,
Zhuoyuan
Li
ab,
Han
Liu
c,
Yang
Ren
c and
David J.
Srolovitz
*ab
aCenter for Structural Materials, Department of Mechanical Engineering, The University of Hong Kong, Hong Kong SAR, China. E-mail: tongqwen@hku.hk; srol@hku.hk
bMaterials Innovation Institute for Life Sciences and Energy (MILES), HKU-SIRI, Shenzhen, China
cDepartment of Physics, JC STEM Lab of Energy and Materials Physics, City University of Hong Kong, Hong Kong SAR, China
First published on 20th May 2025
Efficient and accurate prediction of material properties is critical for advancing materials design and applications. Leveraging the rapid progress of large language models (LLMs), we introduce ElaTBot, a domain-specific LLM for predicting elastic constant tensors and enabling materials discovery as a case study. The proposed ElaTBot LLM enables simultaneous prediction of elastic constant tensors, bulk modulus at finite temperatures, and the generation of new materials with targeted properties. Integrating general LLMs (GPT-4o) and Retrieval-Augmented Generation (RAG) further enhances its predictive capabilities. A specialized variant, ElaTBot-DFT, designed for 0 K elastic constant tensor prediction, reduces the prediction errors by 33.1% compared with a domain-specific, materials science LLM (Darwin) trained on the same dataset. This natural language-based approach highlights the broader potential of LLMs for material property predictions and inverse design. Their multitask capabilities lay the foundation for multimodal materials design, enabling more integrated and versatile exploration of material systems.
While experimental approaches for determining materials properties remains the gold standard, they are often hindered by expense and the time required to synthesize materials and measure properties (and, at times, lead to results that are either inconsistent or not sufficiently accurate), such as in the case of elastic constant measurements.4 Recent advancements in simulation techniques and computational power have made computational modeling a critical tool for property prediction. Given the diversity of length and time scales that control material properties, multi-scale modeling has emerged as an often efficient and sufficiently accurate approach for materials property prediction. For example, atomistic simulations with quantum-mechanical accuracy can accurately predict the full elastic constant tensors and/or band gaps (using hybrid exchange–correlation functionals), while phase-field modeling and other continuum based-methods enable microstructure evolution and defect property prediction. However, challenges (e.g., data transfer and error propagation) often remain significant obstacles to achieving accurate, macroscopic predictions in multi-scale modeling frameworks. The emergence of large language models (LLMs) presents a new opportunity for materials property prediction, with the potential to close the gaps between experiment data (e.g., sourced from literature databases) and computational materials simulation approaches.5
LLMs, for example ChatGPT, have demonstrated some remarkable successes across a wide range of materials applications, including high-throughput discovery of physical laws,6 generation of metal–organic frameworks (MOFs),7 design of chemical reaction workflows,8 determining crystal structure (CIF, crystallographic information file),9 electron microscopy image analysis,10 and guiding automated experiments.11 LLMs achieve this by leveraging their capabilities such as rapid literature summarization,12 prompt engineering13 and/or integration with external tools.14 This approach can make them superior to traditional machine learning (ML) models, particularly when dealing with complex and multitask processes at scale.15 One major strength of LLMs is their foundation in natural language-based training, fine-tuning, and application, which lowers the barrier to entry for researchers without a strong background in computer science or coding.16 Moreover, underlying pre-trained models encode extensive materials science knowledge, giving LLMs remarkable ability in cases where datasets are sparse (through transfer learning), an achievement that previously required highly specialized algorithms.17
Given the strong performance of LLMs on a wide range of low-dimensional classification and regression tasks in computer science,18 there is growing interest in leveraging LLMs to improve numerical property prediction in materials science. Recent studies, including LLM-Prop,19 CrysMMNet,20 and AtomGPT,21 illustrate two major strategies. LLM-Prop and CrysMMNet introduce architectural modifications of LLMs followed by fine-tuning, whereas AtomGPT preserves the original LLM architecture. Despite their methodological differences, all three approaches convert crystal structures into text descriptions, and fine-tune the LLMs to predict individual material properties such as the band gap, formation energy, or bulk modulus. These studies demonstrate that text-based encoding of structural information can enhance predictive accuracy. Recent studies examined the impact of prompt design on LLM property prediction performance,22 and have benchmarked LLM-based methods against conventional models on out-of-distribution datasets.23 These comparisons highlight the value of prompt design for optimizing LLM materials property prediction performance. Although the aforementioned works prove that LLMs can outperform traditional models in predicting certain scalar properties, there are also contrary results, especially when faced with small datasets.24 For example, Jablonka et al.25 showed that while LLMs can predict properties like HOMO–LUMO gaps, solubility, photoswitching behavior, solvation free energies, and photoconversion efficiency, the results were no better than with traditional ML models. Enhancing the quantitative prediction capabilities of LLMs, while leveraging their strengths in natural language interaction and multitasking, can significantly expand their potential in materials science applications.
In this work, we focus on predicting the elastic constant tensor as a case study of quantitative prediction of a material property. The elastic constant tensor is a fundamental property that describes the elastic response of materials to external forces26 and serves as a indicator of the nature of intrinsic bonding within a material.27 Mechanical (Young's modulus, Poisson's ratio,…), thermal (thermal conductivity), and acoustic (sound velocity) properties can all be derived starting from the elastic constant tensor28 (often together with other basic material properties). Here, we introduce ElaTBot and ElaTBot-DFT (DFT is quantum mechanical density functional theory), LLMs developed through prompt engineering and knowledge fusion training. ElaTBot is designed to predict elastic constant tensors, bulk modulus at finite temperatures, and propose materials with specific elastic properties.
To our knowledge, ElaTBot is the first model capable of directly and efficiently predicting the full elastic constant tensor at finite temperatures. ElaTBot-DFT, a variant specialized for 0 K elastic constant tensor prediction, reduces prediction error by 33.1% compared to the material science LLM Darwin29 using the same training and test sets. These results highlight the potential of LLMs for numerical materials property predictions.
![]() | ||
Fig. 1 Datasets and overview of the ElaTBot for predicting elastic properties. (a) Comparison of the number of materials in the Materials Project database30 with available data on crystal structures, band structures, and elastic properties. The availability of elastic constant tensors data is significantly lower than that of crystal and band structure data. (b) Overview of existing methods used to predict elastic constant tensors, which primarily relied on element descriptors and structural features (constructed by CIF). (c) A flowchart illustrating the process of using large language models (LLMs) to acquire material knowledge. This method enables researchers to gain domain-specific insights, allowing those without extensive programming skills or theoretical expertise to conduct research, thereby lowering the entry barrier into materials science. (d) Capabilities of our specialized LLM ElaTBot. By incorporating elastic constant tensors data at finite temperatures, we develop an LLM-based agent, ElaTBot, which is capable of predicting elastic constant tensors, enhancing prediction without retraining by leveraging external tools and datasets, and generating chemical composition for materials with specific modulus. |
Fig. 1(c) presents an integrated approach, combining ML and natural language processing, for predicting material properties and identifying materials with targeted properties. Specifically, for elastic properties prediction and materials generation, we developed two domain-specific LLMs: ElaTBot and ElaTBot-DFT, which predict elastic properties such as the elastic constant tensor, bulk, shear and Young's moduli, as well as the Poisson ratio. To further improve user interaction and task handling, we implemented an AI-driven agent capable of utilizing tools and databases, and general LLMs to perform complex, multi-step tasks. This agent can process new (and unseen) data by integration of external tools and vector databases. Its responses can be fed into general LLMs (e.g., GPT-4, Gemini) to further extend its capabilities and tackle more complex, multi-step tasks. Fig. 1(d) shows three capabilities of our specialized LLM ElaTBot: prediction, Retrieval-Augmented Generation (RAG)-enhanced prediction35 without retraining, and generation.
To train the ElaTBot, we first used robocrystallographer36 to extract structural text descriptions, then employed Pymatgen37 to obtain compositional information. We then integrate these elements into text-form prompts, subsequently fine-tuning the general LLM Llama2-7b model to yield ElaTBot-DFT, a specialized model for predicting elastic constant tensors at 0 K. ElaTBot-DFT serves as a benchmark for elastic constant tensor prediction, particularly since other models are limited in addressing finite-temperature predictions. Next, we employ several steps to enhance LLM performance,38 incorporating finite-temperature data and fusing this knowledge to develop ElaTBot. Prompt engineering leads to a reduction of the prediction error of the average value of the elastic constant tensor (see Methods) by 33.1% for ElaTBot-DFT compared to Darwin,29 a materials science LLM built on the same dataset. We ran the test set twice to ensure the reliability of the LLM results. Through knowledge fusion, ElaTBot accurately fits the temperature-dependent bulk modulus curves (derived from the elastic constant tensor) for new multicomponent alloys, with errors near room temperature approaching the average error for the 0 K test set. RAG-enhanced35 predictions with limited finite-temperature data further improves ElaTBot errors for the bulk modulus from 27.49% to 0.95% across nine alloys at various temperatures without retraining or fine-tuning.
We integrate ElaTBot with GPT-4o to propose/screen materials based upon bulk modulus and other requirements of targeted applications. These include materials with low corrosion rates and high biocompatibility (measured by median lethal dose) with a bulk modulus similar to that of bone for bone implantation, high bulk but low shear modulus materials suitable for exoskeletons of soft robots, corrosion-resistant materials suitable for saline environments, and materials for the protective layers of LiCoO2 electrode.
We conducted a series of “experiments” to assess the effects of different input formats on model performance: JSON-formatted (JavaScript Object Notation) structure descriptions (prompt type 1), textual descriptions of crystal structure (prompt type 2), textual descriptions of the composition (prompt type 3), and textual description of both chemical composition and crystal structure (prompt type 4). The model was trained using a 0 K density functional theory (DFT) dataset containing 9498 materials with elastic constant tensor data from the Materials Project, with 500 materials for validation and 522 for testing. Fig. 2(a) and ESI Table S2† show that prompt type 4 achieves a mean absolute error (MAE; the average of the absolute differences between the predicted and actual values for all data points) of 2.32 GPa and R2 of 0.965 for predicting the average elastic constant tensor component , outperforming other prompt types (explicit definitions of the MAE and
are in Methods). Compared to prompt type 1 (JSON format), prompt type 4 reduces the MAE by 16.8% and increases R2 by 1.9%. When compared to prompt type 2 (crystal structure descriptions only), prompt type 4 achieves a 5.3% reduction in MAE and a 0.8% increase in R2. Compared to prompt type 3 (composition descriptions only), prompt type 4 yields a 13.1% reduction in MAE and a 0.9% increase in R2. These results demonstrate that LLMs perform better when trained with natural language-like inputs, and that using both structural and compositional information improves elastic constant tensor prediction. The bulk modulus results (derived from the elastic constant tensor) in Fig. 2(b) confirm this: prompt type 4 achieves an MAE of 7.74 GPa and an R2 of 0.963, representing a 14.4% reduction in MAE and a 1.7% increase in R2 compared to prompt type 1. Therefore, prompt type 4 was selected for training Llama2-7b for our ElaTBot-DFT model.
![]() | ||
Fig. 2 Prediction abilities of ElaTBot-DFT and ElaTBot. (a and b) Performance comparison of the Llama2-7b model using different prompt types, the MatTen model, random forest model, and Darwin model in predicting ![]() |
We compared the performance of ElaTBot-DFT with two widely-used models for predicting the full elastic constant tensor: the random forest model, which utilizes Magpie (Materials Agnostic Platform for Informatics and Exploration) features based on composition,39 and the MatTen model, which employs a crystal structure graph neural network.28 As shown in Fig. 2(a, b), ESI Fig. S1† and Table S3,† when trained on the dataset, ElaTBot-DFT using prompt type 4 achieves a 30.3% reduction in MAE and a 4.4% increase in R2 for predicting the average elastic constant tensor components compared to the random forest model. Compared to the MatTen model, ElaTBot-DFT reduces MAE by 4.5% and improves R2 by 0.2%. This demonstrates that, even with a relatively small dataset, LLMs trained with well-designed textual descriptions can outperform traditional methods, contrary to previous studies using QA-based training approaches.25 We also examined the symmetry of the generated elastic constant tensors that result from the rigorous application of crystal symmetries; this symmetry requires certain Cij components to be zero and a fixed relationship between some others. Under strict criteria (error margin of ±2 GPa), ElaTBot-DFT achieves a symmetry accuracy of 94%, significantly outperforming MatTen (5%) and the random forest model (6%) (Fig. 2(c)). Traditional numerical models tend to produce small non-zero values due to algorithmic limitations, while the natural language-based model, ElaTBot-DFT, accurately outputs a “0” where appropriate. We also tested the elastic stability of all of the materials in the test (i.e., the Born condition – the elastic constant tensor is positive definite), as shown in Fig. S9.† For the 519 materials in the test set, 518 are found to be elastically stable for predictions of both the random forest model and our ElaTBot-DFT model, whereas the predictions of MatTen failed the stability test in 32 cases.
We further compared ElaTBot-DFT predictions with those from the domain-specific materials LLM, Darwin, for elastic constant tensor prediction. Domain-specific LLMs are widely believed to outperform general LLMs on specialized problems;40 however, as shown in Fig. 2(a, b, d, e) and ESI Table S3,† Darwin (even after fine-tuning on the same dataset) underperforms ElaTBot-DFT in predicting the and bulk modulus. Specifically, the MAEs of ElaTBot-DFT are 33.1% and 31.8% lower than those of Darwin for
and bulk modulus, respectively. This suggests that integrating the reasoning abilities of a general LLM with fine-tuning on a specific dataset may yield better results for tasks requiring quantitative property predictions. Fine-tuning a model with domain-specific knowledge (like Darwin) can lead to gaps in its abilities and knowledge loss, which may reduce the effectiveness in specialized tasks.41
We further examined the performance of ElaTBot-DFT across different crystal systems, as summarized in Tables S6 and S7.† The model demonstrates consistently strong predictive accuracy across all crystal systems except for the triclinic system, for which there are only three/sixty data points in the test/training sets. The performance of ElaTBot-DFT is particularly strong for the very common cubic system, with R2 > 0.97 for both elastic constant tensor and bulk modulus predictions. The performance is slightly lower in the orthorhombic and monoclinic systems, with R2 ∼ 0.94. This demonstrates that while ElatBot-DFT is broadly effective across different crystal systems, predictive accuracy is influenced by training/test sample sizes (this is an issue only for less common crystal systems).
Finally, we integrated the finite-temperature dataset and designed four tasks (ESI Table S4†) with corresponding training text inputs (elastic constant tensor prediction, bulk modulus prediction, material generation based on bulk modulus, and text infilling) to conduct multi-task knowledge fusion training. This approach equips ElaTBot with multiple capabilities, including the ability to predict elastic constant tensors at finite temperatures. Although the text infilling task does not directly predict material properties, previous studies have shown that it improves the overall multi-task performance.38 To test the effectiveness of ElaTBot, we selected three multicomponent alloys not in the training set (cubic Ni3Al, γ′-PE16 (Ni72.1Al10.4Fe3.2Cr1.0Ti13.3), and tetragonal γ-TiAl (Ti44Al56)) and predicted their bulk modulus as a function of temperature (based on the full elastic constant tensors). Given the limited finite-temperature training data-just 1266 samples-and the vast compositional space of alloys, predicting accurate values over a wide range of temperature and compositions is inherently challenging. We predicted the bulk modulus at 11 temperatures for Ni3Al and γ′-PE16 and 15 temperatures for γ-TiAl. Fig. 2(f) shows the variation of prediction errors for three alloy systems (not in the training set) as a function of temperature; the blue dashed lines indicating the error trends. A clear increase in prediction error with temperature is observed. We note that in the original training set, there are 10520 samples at 0 K and only 1266 entries at finite-temperature conditions. The errors are larger for the quinary γ′-PE16 (Ni72.1Al10.4Fe3.2Cr1.0Ti13.3) alloy compared with the binary Ni3Al alloy. The original training set had 213 times more binary than quinary data. This highlights that the model performance is less reliable for situations (alloy and temperature) where the test cases differ greatly from those in the training set. Nonetheless, the model performs remarkably well compared across a wide range of composition and temperature, especially in light of the fact that experimental data on compositionally complex materials and at high temperatures is rare (and expensive to generate). Fig. 2(f) also shows that the fitted lines from the predictions of ElaTBot closely align with experimental data,42,43 particularly at low temperatures, where ElaTBot exhibits smaller errors. This demonstrates that incorporating the 0 K DFT dataset from the Materials Project helps reduce prediction errors, highlighting the effectiveness of the multi-task knowledge fusion training approach.
Fig. 3(a) compares the ElaTBot bulk modulus predictions for γ-TiAl at 170 K with and without RAG support. Since the finite temperature data was not in the original ElaTBot training set, the model automatically queries our external database, finds bulk modulus data for γ-TiAl at similar temperatures (the 170 K data was removed from the database for comparison purposes). The predicted value (110.77 GPa) differs by only 0.1% from the true value, whereas without RAG, the error increases to 2.4%. To ensure a fair comparison, we customized the RAG prompt in order to isolate its influence from training prompts. Further testing on alloy data at various temperatures (see Fig. 3(b–d) and ESI Table S5†) demonstrates that RAG reduces the average error from 27.49% to 0.95%. RAG prediction performance can be improved by increasing the quantity and quality of data. This is demonstrated in Table S8,† where increasing the number of data points by 32% led to a 50% decrease in the error, compared with experiments. By incorporating RAG, ElaTBot achieves RAG-enhanced prediction capabilities, allowing it to reason well beyond its training set that contained minimal similar data.
Based on the prediction and generation abilities above, we developed a multi-function agent interface (Fig. S2†) that allows researchers to predict, generate, or engage in RAG-enhanced prediction through natural language dialogue, without the heavy load of coding.
Despite these promising results, several challenges remain, particularly in ensuring the stability of continuous quantitative predictions. For example, minor variations in temperature, such as between 500.12 K and 500.13 K, may lead to inconsistencies in property predictions like the bulk modulus. To address these issues, future work will focus on generating larger datasets,53 developing multi-agent systems for incremental task-solving,54 exploring novel digital encoding methods for LLMs,55 and guiding LLMs to learn materials laws (such as the general trend of decreasing elastic constant tensor values with increasing temperature). These improvements, along with the addition of constraints or regularization techniques, may enhance the stability of numerical predictions.
Our work presents a fresh perspective on using LLMs for the quantitative prediction of material properties and facilitating inverse material design. A key benefit of domain-specific LLMs is the ability to interact with and generate results through natural language, without requiring users to have extensive knowledge of the underlying ML techniques. This lowers the barrier to entry for computational materials design and fosters broader participation in the field. The integration of domain-specific and general-purpose LLMs allows for access to broader research data, enhancing the synergy between materials science and AI. These advancements have the potential to revolutionize both fields by accelerating innovation, discovery, and application.
![]() | ||
Fig. 5 Dataset and model architecture for ElaTBot and ElaTBot-DFT. (a and b) Data distribution for Materials Project (MP) 0 K DFT dataset and finite temperature dataset. For ElaTBot-DFT, we used data from the MP dataset, converting it to prompt type 4 (shown in Table S1†). The transformed textual descriptions were separated to the training set, validation set, and test set and then used for training ElaTBot-DFT. For ElaTBot, as shown in the lower part of (b), we combined data from the MP dataset and the finite temperature dataset, then converted it into question and answer (Q&A, each Q&A pair is a task and we designed four tasks) as input and output for training ElaTBot, enabling ElaTBot to acquire multiple capabilities. (c) Number of Q&A entries used for training ElatBot, categorized by the specific tasks in the training. The elastic constant tensor prediction task involves training the ElatBot to predict elastic constant tensors based on textual descriptions of materials. The bulk modulus prediction task requires the ElatBot to predict the bulk modulus from material textual descriptions. The material generation task aims to enable the ElatBot to generate material chemical formulas based on given bulk modulus and temperature. The description infilling task, given a description of the chemical formula and compositions, masks the formula with [MASK], and the ElaTBot is then expected to fill in [MASK] with the correct chemical formula. (d) Model architecture for training ElaTBot-DFT. (e) Knowledge fusion training workflow for ElaTBot, detailing how external data and tools are integrated to enhance model capabilities. |
In addition to the 10520 data points for elastic constant tensors at 0 K, we manually extracted 1266 experimental elastic constant tensor data points at finite temperatures from ref. 57. The distribution of elastic constant tensor data at different temperatures for this dataset is shown in Fig. 5(b). To enable multitasking in ElaTBot, we designed four tasks (Fig. 5(c)) and converted material composition and structural information into textual descriptions as outlined in ESI Table S4.† Given the limited availability of finite-temperature data, we did not create a separate test set for this subset. Instead, we evaluated predictive performance on unseen alloy compositions, including cubic phase Ni3Al, γ′-PE16, and tetragonal phase γ-TiAl.
For fine-tuning, we applied LoRA+,59 a parameter-efficient adaptation technique that extends basic LoRA.60 LORA+ allows the adapter matrices to be fine-tuned at different learning rates, reducing GPU memory usage by approximately half without compromising data input capacity, thereby accelerating training. A detailed comparison between LoRA and LoRA+ is provided in Fig. S8 of ESI.†
The LLMs (ElaTBot-DFT, ElaTBot, and Darwin) were managed using the Llama-factory61 framework, which facilitates model loading and parameter tuning. The architectural advantages of the Transformer architecture of Llama-2, including LayerNorm, residual connections, and dropout mechanisms limit overfitting on small datasets. We implemented warm-up strategies (a neural network training technique where the learning rate is gradually increased to the initial learning rate during the first few training epochs62) and a cosine learning rate scheduler (that gradually reduces the learning rate during training63) to ensure smooth gradient updates when training on small datasets. The random forest and MatTen models were trained directly using Python and PyTorch. The LLMs were optimized by calculating cross-entropy loss, , where yi is the true label and pi is the predicted probability. The random forest model used squared error loss,
, where yi is the true value and ŷi is the predicted value. The MatTen model employed mean squared error loss
for optimization, consistent with the method specified in ref. 28. These losses guide the models in learning to accurately predict the elastic constant tensor.
The predicted elastic constant tensor is expressed in Voigt form:
The average value of elastic constant tensor Cij are calculated as follows:
The bulk modulus K are calculated by pymatgen37 as follows:
The mean absolute error (MAE) and coefficient of determination (R2) are used to evaluate the model performance. The MAE is calculated as follows:
R
2 is
For finite-temperature predictions, ElaTBot generated elastic constant tensors for Ni3Al and γ′-PE16 at T = 90, 113, 142, 162, 192, 223, 253, 283, 303, 333, and 363 K, and for γ-TiAl at T = 30, 50, 70, 90, 110, 130, 150, 170, 190, 210, 230, 250, 270, 290, and 298 K. Due to the limited training data and the discrete nature of the prediction of ElaTBot, we did a linear fit to the predicted values and used this to evaluate the deviation from experimental data and analyze error trends.
The RAG-enhanced prediction ability of ElaTBot was enabled through the integration of RAG, which allows the model to perform real-time learning without requiring retraining. The knowledge base consists of finite temperature, experimentally-measured, elastic constant tensor data for three materials from the literature: Ni3Al at T = 90, 113, 142, 162, 192, 223, 253, 283, 300, 303, 333, 363, 400, 500, 600, 700, 800, 900, 1000 and 1100 K, γ′-PE16 at T = 90, 113, 142, 162, 192, 223, 253, 283, 300, 303, 333, and 363 K, and γ-TiAl at T = 30, 50, 70, 90, 110, 130, 150, 170, 190, 210, 230, 250, 270, 290, and 298 K.42,43,65 To ensure a rigorous, unbiased evaluation of our RAG-based system, the database accessed by the RAG system excluded 18, randomly selected data points from the full knowledge base; these 18 points formed the test-set shown in ESI Table S5.† For materials absent from the constructed knowledge base (specifically those not related to the three example alloy systems investigated in this study), the RAG was not activated. To ensure reliability in cases of uncertainty, the prompt included the explicit instruction: ‘If you don't know the answer, just say that you don't know.’ In instances where the language model responded with ‘don't know’ even after RAG was applied, the agent system reverts to using the base LLM without retrieval assistance. The RAG module was implemented using langchain,66 and follows a multi-step process including document loading, splitting, storage, retrieval, and output generation. This process enhances the ability of ElaTBot to update its knowledge and handle new data efficiently, as outlined in Fig. 3(a).
Footnote |
† Electronic supplementary information (ESI) available: ESI Notes, ESI Fig. S1–S9, ESI Tables S1–S8. See DOI: https://doi.org/10.1039/d5dd00061k |
This journal is © The Royal Society of Chemistry 2025 |