Youjia
Li
*a,
Vishu
Gupta
abc,
Muhammed Nur Talha
Kilic
d,
Kamal
Choudhary
ef,
Daniel
Wines
e,
Wei-keng
Liao
a,
Alok
Choudhary
a and
Ankit
Agrawal
*a
aDepartment of Electrical and Computer Engineering, Northwestern University, Evanston, IL, USA. E-mail: youjia@northwestern.edu; ankit-agrawal@northwestern.edu
bLewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
cLudwig Institute for Cancer Research, Princeton University, Princeton, NJ, USA
dDepartment of Computer Science, Northwestern University, Evanston, IL, USA
eMaterial Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, Gaithersburg, MD, USA
fDeepMaterials LLC, Silver Spring, MD 20906, USA
First published on 18th December 2024
Graph-centric learning has attracted significant interest in materials informatics. Accordingly, a family of graph-based machine learning models, primarily utilizing Graph Neural Networks (GNN), has been developed to provide accurate prediction of material properties. In recent years, Large Language Models (LLM) have revolutionized existing scientific workflows that process text representations, thanks to their exceptional ability to utilize extensive common knowledge for understanding semantics. With the help of automated text representation tools, fine-tuned LLMs have demonstrated competitive prediction accuracy as standalone predictors. In this paper, we propose to integrate the insights from GNNs and LLMs to enhance both prediction accuracy and model interpretability. Inspired by the feature-extraction-based transfer learning study for the GNN model, we introduce a novel framework that extracts and combines GNN and LLM embeddings to predict material properties. In this study, we employed ALIGNN as the GNN model and utilized BERT and MatBERT as the LLM model. We evaluated the proposed framework in cross-property scenarios using 7 properties. We find that the combined feature extraction approach using GNN and LLM outperforms the GNN-only approach in the majority of the cases with up to 25% improvement in accuracy. We conducted model explanation analysis through text erasure to interpret the model predictions by examining the contribution of different parts of the text representation.
Despite the unique strength of GNN architecture, its reliability and accuracy are dependent on the size of available datasets. Although material datasets are regularly growing in size,20–22 there are still several material properties that are expensive to compute. To tackle the performance degradation challenges due to limited dataset size, Gupta et al.23–25 proposed a series of transfer learning (TL) frameworks that capture cross-property learnings from a source model trained with source property to improve the predictive ability for target property model with small datasets. Work in (ref. 23) applied two cross-property transfer learning strategies to the GNN-based model to learn structure-aware representations from the source property model. Firstly, a fine-tuning approach was explored by leveraging a pre-trained ALIGNN model for parameter initialization. Secondly, a feature-extraction-based TL approach was investigated by extracting embeddings from the knowledge model as features. The predictive model performance collected from diverse materials datasets demonstrated that the latter approach is better suited for small datasets.
Meanwhile, Large Language Models (LLMs), with their generality and transferability, offer an alternative solution for materials science knowledge discovery.26–28 In recent years, the rapid development of LLMs has led to a surge of revolutions for numerous text-related tasks in various domains.29–32 Their exceptional performance has incentivized researchers to apply them in structure–property relationship discovery. Particularly, pre-trained domain-specific language models have been proven to be effective in encapsulating latent knowledge embedded within domain literature.33,34 The combination of fine-tuning-based transfer learning and pre-trained domain-specific language models exhibits state-of-art performances in both property prediction tasks26,33,34 and material generation tasks.35 In the era of LLMs, there has been a growing focus on investigating the potential of LLMs to enhance the generalization, transferability, and few-shot learning capabilities of Graph Learning.36–40 However, in the context of crystal property prediction, there is less visibility of works that attempt to combine textual information extracted from natural language and structural-aware learnings from the aforementioned GNN model.
In this work, we present a novel framework that combines contextual word embeddings extracted from pre-trained LLMs into the structure-aware embeddings extracted from GNNs. This integration aims to combine the strengths of these two models to enhance both the predictive accuracy and model interpretability. The workflow comparison of the original GNN-based transfer learning and the proposed approach is shown in Fig. 1a. In the proposed workflow, we first reproduce the GNN embeddings as extracted structure-aware feature vectors. On a separate working thread, we employ pre-trained LLM models to generate contextual embedding for string representation of the same data samples. Lastly, we concatenate the embeddings from two sources and feed them to the data mining model to predict target material properties. With this concatenation, we aim to combine unique insights from natural language contexts with structural learnings represented in GNN layers. LLM embeddings can provide a deep understanding of text sequences, including nuanced semantic relationships, syntactic structures, and commonsense reasoning. These complementary learnings are anticipated to refine data sample representations for the downstream predictive model. In addition, by harnessing human-readable text inputs, LLM embeddings enable a direct mapping between the model's predictions and the string representation it operates on. This approach facilitates model interpretability by enabling tracing the impact of specific text representations on the model's outputs.
For the source GNN knowledge model, we use Formation Energy from the MP dataset. For the target property prediction, we use multiple DFT-computed properties from the 2022.12.12 version of JARVIS-3D dataset, which consists of 75993 materials with properties including formation energies, energy above the hull, modified Becke Johnson potential (MBJ) bandgaps,47 spectroscopic limited maximum efficiency (SLME),48 magnetic moments, topological spin–orbit spillage49,50 and superconducting transition temperature.51 Across all properties, we use 80%
:
10%
:
10% splits with random shuffling for training, validation and testing.
Table 1 indicates that the proposed transfer learning approach (ALIGNN-MatBERT-based TL) outperforms ALIGNN scratch models in 5/7 cases. When compared against ALIGNN embedding-only transfer learning approach, the proposed hybrid approach produces superior performance for all 7 properties. The results illustrate the advantage of using the proposed hybrid representation when the data size is small. We believe the combined feature representation has benefited from the information-dense embeddings from LLM model, which contains textual insights from the description. The comparison of pre-trained LLM models shows that MatBERT is leading in most cases by generating more informative word embeddings. This aligns with expectations because the generated text incorporates domain-specific knowledge, benefiting from MatBERT's specialization in better understanding materials science terminology and scientific reasoning. We further delve into the performance gain of combining GNN and LLM embeddings through a parity plot of DFT-calculated versus machine-learning-predicted bandgap property. As illustrated in Fig. 2, the successful elimination of extreme prediction errors in ALIGNN Scratch model contributes to the overall MAE performance gain.
Property | Data size | XGBoost with matminer | ALIGNN scratch | ALIGNN-based TL | ALIGNN-BERT-based TL | ALIGNN-MatBERT-based TL | % error change | |||
---|---|---|---|---|---|---|---|---|---|---|
ChemNLP | Robo-crystallographer | ChemNLP | Robo-crystallographer | Against ALIGNN scratch | Against ALIGNN-based TL | |||||
Formation energy (eV per atom) | 75![]() |
0.0640 | 0.0297 | 0.0346 | 0.0347 | 0.0366 | 0.0345 | 0.0339 | 13.94 | −2.16 |
E hull (eV per atom) | 74![]() |
0.0543 | 0.0332 | 0.0383 | 0.0365 | 0.0366 | 0.0359 | 0.0357 | 7.65 | −6.80 |
Magout (μB) | 74![]() |
0.5509 | 0.5246 | 0.4465 | 0.4115 | 0.4385 | 0.3932 | 0.4211 | −25.05 | −11.95 |
BandgapmBJ (eV) | 19![]() |
0.304 | 0.2571 | 0.2771 | 0.2559 | 0.2523 | 0.2720 | 0.2516 | −2.13 | −9.21 |
Spillage | 11![]() |
0.3337 | 0.3344 | 0.3285 | 0.3210 | 0.3226 | 0.3215 | 0.3140 | −6.10 | −4.41 |
SLME (%) | 9762 | 5.0917 | 4.7064 | 4.6516 | 4.8874 | 4.7128 | 4.6579 | 4.5634 | −3.04 | −1.90 |
Tc_supercon (K) | 1054 | 2.7499 | 2.5144 | 2.2441 | 2.1469 | 2.0363 | 2.2055 | 2.1095 | −19.01 | −9.26 |
Text source comparison shows that Robocystallographer marginally outperforms ChemNLP in generating text descriptions as it leads in 10/14 cases. Here, we present the sample string representation of LiCeO2 in Fig. 4 for comparison of different text representations. At a high level, while both paragraphs from two different sources share a formal, technical style and neutral tone, the text from Robocystallographer is a more direct, conversational, and versatile language use, possibly making it more accessible and easier to comprehend. In contrast, the text from ChemNLP has a dense style and is more detailed and descriptive with a sheer amount of numerical property information. As a result, the concise and more natural language-like style of the former could possibly enhance the accuracy of prediction.
Property | Robocrystallographer text | ChemNLP text | ||||
---|---|---|---|---|---|---|
ALIGNN-MatBERT-based TL | ALIGNN-based TL | MatBERT-based TL | ALIGNN-MatBERT-based TL | ALIGNN-based TL | MatBERT-based TL | |
Formation energy (eV per atom) | 0.0339 | 0.0346 | 0.0871 | 0.0345 | 0.0346 | 0.1018 |
E hull (eV per atom) | 0.0357 | 0.0383 | 0.0601 | 0.0359 | 0.0383 | 0.0683 |
Magout (μB) | 0.4211 | 0.4465 | 0.5571 | 0.3932 | 0.4465 | 0.5712 |
BandgapmBJ (eV) | 0.2516 | 0.2771 | 0.3598 | 0.2720 | 0.2771 | 0.4012 |
Spillage | 0.3140 | 0.3285 | 0.3507 | 0.3215 | 0.3285 | 0.3523 |
SLME (%) | 4.5634 | 4.6516 | 5.3658 | 4.6579 | 4.6516 | 6.0009 |
Tc_supercon (K) | 2.1095 | 2.2441 | 2.5879 | 2.2055 | 2.2441 | 2.3506 |
The first text-based model explanation analysis we consider is the word-level rationale extracted from string representations. In the regression task setting, this can be done by masking the target word or token at the inference stage and measuring the significance of predicted property change. Fig. 3 illustrates this approach, showing that the bond length value (2.50) is the most influential word among all candidates for this particular crystal sample.
Despite these insights, the above model interpretation has clear limitations in both applicability and effectiveness. As the string representation varies in length and vocabulary across samples, word-level analysis is limited to individual samples and cannot be generalized to the entire dataset. Furthermore, in a regression setting, the impact of masking a particular word from one sample is not directional, making it unclear whether the model's performance is improved or degraded. Therefore, we also perform a second model interpretation analysis. At the sentence level, both text generations from two sources are well-organized across all samples, allowing for the systematic removal of specific descriptive sentences from the entire dataset to measure the impact on prediction performance.
To facilitate text-based removal, we tag sentences in generated descriptions based on the textual information contained as illustrated in Fig. 4. The text generated by Robocystallographer starts with an opening introduction sentence (tagged as [summary]) that states the material's crystallization and space group. Then, for each element, it describes the number of primary atomic sites for multi-site elements (only present in multi-site samples, tagged as [site info]). Then the description iterates through each atomic site and describes the bonding environment and geometric arrangement for each site ([tagged as structure coordination]). Following this, it includes the measurement of bond distances (tagged as [bond length]) and bond angles (present only in some samples, tagged as [bond angle]). On the other hand, the text sourced from ChemNLP begins with the chemical information (tagged as [chemical info]), which details chemical properties including formula and atomic fractions. Then it introduces the structure information (tagged as [structure info]), detailing the lattice parameters, space group, top X-ray diffraction (XRD) peaks, material density, crystallization system, point group and Wyckoff positions. Finally, the bond lengths (tagged as [bond length]) are included for every atomic pair present in the structure.
In the family of erasure-based explainable AI (XAI) techniques25,54–57 for NLP tasks, rationale comprehensiveness54 provides a theoretical framework for classification tasks by measuring the decrease in model confidence in the correct prediction when the tokens comprising the provided rationale are erased. Here, we extend the concept of rational comprehensiveness to the regression problem setting by measuring the MAE increase with the removal of the target subregion of text across all samples. We systematically categorize string descriptions into 5 tags based on the content across all samples. Then we can measure the comprehensiveness of each tag by constructing a contrast text representation dataset for original text data T, T/ti, which is the original text dataset T with tag ti removed from all samples. At the testing stage, two versions of the text dataset will be piped into the trained ALIGNN-MatBERT-based TL forward model. We can then calculate the comprehensiveness of tag ti as the MAE difference between the original full-text representation and the erased version using formula (1):
Comprehensivenessi = MAE(T/ti) − MAE(T) | (1) |
A high comprehensiveness score here implies that the tagged text significantly influenced the prediction, whereas a low score suggests the opposite.
Results from model explanation analysis emphasize the importance of structure information, particularly structural coordination descriptions. During the analysis, the model is not retrained, and only the input text representation is adjusted at the inference stage. The MAE values after removal for each property are collected in Fig. 5. As shown in Fig. 4 and 5, for Robocrystallographer text, the removal of each tag causes a varied level of degradation in prediction performance. The most impactful tag observed is structure coordination across all properties. Bond lengths and summary tags are ranked in the second tier of impactful tags. Similarly, for ChemNLP text, structure information has the greatest impact on performance gain with a drastic MAE increase (114.59%) observed in the prediction of energy above hull (Ehull) property. One divergence observed in the two text representation sources is the significance of the bond length tag. For ChemNLP text, the bond distance information has a negligible impact on the MAE changes. This can be attributed to the different textual representations of bond information from the two text sources. The bond length description by Robocystallographer follows a more logical sequence and a more natural language-like style, which turns out to be favored by the downstream LLM.
Our model explanation analysis takes advantage of the natural-language interface provided by the text representation of crystal structures. The presented erasure-based analysis is an illustration of interpreting model predictions by relating performance to a human-readable text representation. Additionally, the results from the model explanation analysis emphasize the importance of structural information. This suggests that despite the dense structural learnings from the pre-trained GNN source model, there are still complementary structural learnings in the text format that remain untapped. When we compare the structural descriptions from the generated text with the ALIGNN model input features, we find overlapping structural properties, such as bond distances, which are directly encoded in the GNN model input, as well as other structural insights that are either missing or indirectly encoded in the GNN model input, such as crystal system. Therefore, the additional structural learnings can either be sourced from the enhanced representation of existing structural feature properties or new structural information that is uniquely present in text descriptions.
To further explore the impact of incorporating LLM embeddings into the transfer learning pipeline, we analyzed the test dataset performance categorized by crystal system and composition prototype. Using bandgap as a representative target property, we plotted the distribution of the top 10% accurate predictions and the overall MAE level in Fig. 6. The bar plot shows the distribution of the top 10% accurate predictions for each crystal system type or chemical composition prototype, while the line plot represents the overall MAE level across the entire test set. We compared predictions from ALIGNN-MatBERT-based embeddings against those from ALIGNN-embeddings-only. Looking at the MAE values grouped by composition prototypes, we find that LLM embeddings improved predictions for A2BC, AB and A2BCD6 prototypes. The top 10% accurate predictions highlights the samples for which the model performs well. The comparison of the composition prototype distribution reveals a shift in the model's predictive strengths, with an increased frequency for ABC2 and A2B prototypes and a decreased frequency for ABC and ABC3 prototypes. For crystal systems, hexagonal and monoclinic systems shows the most significant improvements by LLM embeddings, evidenced by both a higher frequency of top 10% predictions and a lower overall MAE.
On a different note, the performance gain achieved with the domain-specific MatBERT over the general BERT model underscores the unique value of a domain-specific tokenizer for knowledge discovery in materials science. A domain-specific tokenizer tailored to the materials science field enhances text processing by accurately recognizing and tokenizing specialized vocabulary, technical terms, chemical formulas, and abbreviations unique to the discipline. To better mine insights from materials science literature, one promising direction is to develop domain-specific tokenization for the pre-training phase of LLMs.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00199k |
This journal is © The Royal Society of Chemistry 2025 |