Hongchen Wanga,
Kangming Li
b,
Scott Ramsaya,
Yao Fehlisc,
Edward Kim*a and
Jason Hattrick-Simpers
*a
aDepartment of Materials Science and Engineering, University of Toronto, Toronto, Ontario M5S 1A1, Canada. E-mail: edwardsoo.kim@mail.utoronto.ca; jason.hattrick.simpers@utoronto.ca
bAcceleration Consortium, University of Toronto, Toronto, Ontario M5S 3H6, Canada
cArtificial, Inc., Austin, Texas 78731, USA
First published on 28th May 2025
Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. In this study, we evaluate the performance and robustness of LLMs for materials science, focusing on domain-specific question answering and materials property prediction across diverse real-world and adversarial conditions. Three distinct datasets are used in this study: (1) a set of multiple-choice questions from undergraduate-level materials science courses, (2) a dataset including various steel compositions and yield strengths, and (3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of “noise”, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study showcases unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance recovery from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.
However, the robustness of LLMs is a critical factor in their practical deployment, yet it remains an underexplored area, particularly in domain-specific applications such as materials science. Previous studies have shown that LLMs struggle to maintain predictive accuracy when the input distribution shifts, exhibiting poor generalization to out-of-distribution (OOD) test data and vulnerability to adversarial attacks.24–26 These challenges highlight the need for systematic robustness evaluations to ensure LLM reliability in real-world scenarios. A key aspect of the robustness of LLMs is their sensitivity to prompt changes either due to innocuous or adversarial reasons.27,28 Variations in how a query or instruction is formulated may cause a response to factually change.27 As an example, 0.1 nm and 1 Å are equivalent but switching them in a prompt could result in different LLM predictions for the same task. Alternatively, the response of the LLM can be deliberately altered through intentional misinformation or misleading inputs.28 These attributes are not only theoretical concerns but are critical for the reliable usage of LLMs as they become integrated into the materials science research and development pipeline. Given that LLMs generate outputs with indifference to truth,29 thoroughly probing LLM prompt sensitivity would allow us to critically evaluate model performance in practical situations; providing informed skepticism for the broad use of LLMs in materials science.
In this work, we conducted a holistic robustness analysis of commercial and open-source LLMs for materials science. While our primary analyses focus on pre-reasoning models due to their consistent single-pass inference structure, we also include a representative reasoning model (DeepSeek-R1 (ref. 30)) in both the initial benchmarking and the robustness evaluation. Reasoning models, such as DeepSeek-R1 and OpenAI-o1,31 incorporate intermediate reasoning steps during inference, which distinguish them from pre-reasoning models. Including DeepSeek-R1 allows us to assess whether such reasoning architectures improve overall performance and robustness under perturbed conditions. Three distinct datasets of domain-specific Q&A and materials property prediction were selected. First, we benchmarked LLMs of different sizes and release periods using prompt engineering to establish baseline and optimal performance boundaries. We then investigated the impact of various textual perturbations, ranging from realistic to adversarial, on LLM performance in materials science Q&A. We then used the matbench_steels dataset to investigate whether pretrained LLMs can move beyond simple interpolation of few-shot examples to capture deeper structure-property relationships. Without fine-tuning, pretrained LLMs demonstrated enhanced predictive ability through few-shot ICL when presented with similar examples to the prediction task. Conversely, when provided with dissimilar examples during few-shot ICL, mode collapse behavior was observed, where the model often generated identical outputs despite varying inputs, suggesting limited generalization capability in OOD settings. Furthermore, we also evaluated a fine-tuned LLM (LLM-Prop18) on a band gap prediction task to assess the robustness of task-specific models, which are increasingly adopted in materials science due to their strong performance on targeted problems.32,33 Counterintuitively, supposedly adversarial perturbations like sentence shuffling enhanced LLM-Prop's predictive capability with significantly truncated prompts. This train/test mismatch behavior, absent in traditional ML models, highlights a potential direction for distilling LLM-based predictive models.
The MSE-MCQs questions are manually categorized into easy (number of questions, n = 39), medium (n = 40), and hard levels (n = 34), based on a set of heuristics, including conceptual complexity, the level of reasoning required, and the presence and difficulty of the calculations. For example, “easy” questions primarily test factual recall or direct application of basic concepts, such as identifying the crystal structure of a material. “Medium” questions involve moderate reasoning or straightforward calculations, such as determining the stress in a material under specific conditions. “Hard” questions require multi-step reasoning or more complex calculations, such as deriving material properties from combined thermodynamic and mechanical data. Some examples are shown in Table 1.
Difficulty | MSE-MCQsa question |
---|---|
Easy | Which of the following most closely describes the ductility of a sample? |
(a) The plastic strain at fracture | |
(b) The elastic strain at fracture | |
(c) The total strain at fracture | |
(d) None of the above | |
Medium | An hypothetical FCC metal has a density of 7.4 g cm−3 and a molar mass of 55.3 g mol−1. Which of the following is the correct number of atom sites (that is, without any vacancies)? |
(a) 1.09 × 1022 atoms per cm3 | |
(b) 1.34 × 10−1 atoms per cm3 | |
(c) 6.80 × 10−22 atoms per cm3 | |
(d) 8.06 × 1022 atoms per cm3 | |
Hard | A cylindrical sample of stainless steel having a Young's modulus of 204.3 GPa, a diameter of 12.0 mm, and initial length of 237.8 mm is loaded to a stress of 411.5 MPa. The sample is then completely unloaded. What will the elastic recovery of this sample be, in mm? The yield strength and ultimate tensile strength of this specific alloy are 292.0 MPa and 688.0 MPa, respectively |
(a) Possible to calculate from information provided, but none of these options are correct | |
(b) 0.96 | |
(c) Not possible to calculate from information provided | |
(d) 239.0 | |
(e) 0.24 | |
(f) 0.48 |
To evaluate the impact of prompt engineering on LLM performance in materials science Q&A tasks, we tested each model under two distinct conditions: (1) without expert prompt (no prompt engineering) – the model received only the multiple-choice question in the user message, with no system prompt or additional instruction, serving as a baseline to assess its default performance; and (2) with expert prompt – the model was provided with a structured system prompt instructing it to act as a domain expert and reason through the problem step-by-step, aiming to enable a direct assessment of how prompt engineering influences reasoning and answer accuracy.
The expert prompt incorporates both expert prompting and zero-shot chain-of-thought (CoT) strategies. Expert prompting involves instructing the LLM to adopt the role of a domain expert, which has been shown to guide responses toward more accurate and knowledge-aligned reasoning.42 Zero-shot CoT prompting complements this by encouraging the model to “think aloud” and generate step-by-step reasoning even without prior examples, potentially improving accuracy in problem-solving tasks.43 These strategies were combined into a single structured system prompt used across all “With Expert Prompt” evaluations. In the Q&A evaluation, the expert prompt includes instructions to define the domain of study, introduces the settings of the questions, and emphasizes step-by-step reasoning and calculations. The goal is to improve the LLMs' ability to retrieve domain-specific knowledge, follow the instructions, and correctly perform reasoning and calculations. The expert prompt is shown below:
Given the lengthy reasoning in the answers and the potential for errors in manual verification, we used the gpt-4-0613 API in a separate client to extract and assess responses automatically. For each trial, the model compared the answer to the provided correct choice, generating a simple binary score (1 for correct, 0 for incorrect). While the evaluation focused on final answers, rare cases occurred where the model based its judgment on the reasoning rather than the final choice. These cases were manually reviewed and corrected when identified. Finally, the average accuracy and standard deviation of each category were calculated and plotted. When selectively compared to manual checks (>2000 answers), the method was found to be reliable, consistently identifying correct answers with over 95% accuracy. The prompt is shown below:
Few-shot learning involves providing the LLM with a few examples of the task at hand, enabling it to learn the pattern and apply it to unseen questions or problems.47 To use LLMs as predictive models, we fed the few-shot examples to the prompt windows of the LLMs. Starting with an instruction, compositions were restructured by separating each element with a space and then paired to their corresponding yield strengths. We varied the number of few-shot examples from 5 to 25 to observe how LLMs' prediction accuracy scales with data size. Beyond 25 points, some models suffered from limited prompt windows. An example of the prompt containing these few-shot examples is shown below.
To compare the predictive capabilities of LLMs and traditional ML models, k-nearest neighbors (KNN) and random forest regressor (RFR) models were also implemented. For direct comparison, each RFR model was trained using the exact same data points that were provided to the LLMs in each few-shot setting. Specifically, for every prediction target, if the LLM received 10 few-shot examples as prompt context, the corresponding RFR model was trained using those same 10 compositions as its training set. To enable a more direct comparison with LLMs, we implemented two variants of the RFR model: one trained directly on the elemental compositions, where each element was represented as a feature with its corresponding fractional value, and another trained on MAGPIE features48 extracted from the compositions. The selected MAGPIE features are presented in the ESI.†
A retrieval-augmented method was used to evaluate the impact of the proximity of the few-shot examples on the predictive performance. Each composition was encoded using its elemental proportions and projected into a lower-dimensional space using principal component analysis (PCA). Given each prediction target, candidate few-shot examples were ranked based on their Euclidean distances (L2 norm) in the PCA-transformed space, enabling the selection of training examples with varying levels of similarity to the target composition. Three settings were chosen based on the distances: (1) random neighbors – few-shot examples were randomly sampled from the dataset without considering proximity; (2) nearest neighbors – examples closest to the prediction target in PCA space were selected to match its local distribution; (3) farthest neighbors – examples most dissimilar to the prediction target were selected to evaluate model generalization under distribution shift. The performance in each setting was evaluated using mean absolute error (MAE), which quantifies the average absolute difference between predicted and true yield strengths. These evaluations aim to probe the sensitivity of LLMs to the choice and proximity of few-shot examples.
Degradation type | Description | Goal |
---|---|---|
Unit mixing | Mixing and converting the units | To test LLMs' interpretation of different unit systems and calculation abilities |
Sentence reordering | Reordering the sentences in the questions | To assess LLMs' capability to maintain comprehension on varied sentence constructions and logical flow |
Synonym replacement | Replacing technical nomenclature with their synonyms | To evaluate the semantic understanding and stability of LLMs |
Distractive info | Adding non-materials-science-related distractive information to the questions | To test LLMs' ability to filter out irrelevant data |
Superfluous info | Adding materials-science-related superfluous information containing numerical values to the questions | To challenge LLMs' ability to identify relevant data without being misled by additional numeric details |
These modifications are expected to vary in their impact on the LLMs' performance, with some potentially degrading it due to their adversarial nature (such as reordering sentences and adding superfluous materials-science information) and others more realistically simulating conditions encountered in real-life scenarios. Considering the inherent variability due to the non-deterministic nature of LLMs, the test was repeated three times for the original, synonym replacement, and distractive info (same input texts). The unit mixing, sentence reordering, and superfluous info were randomized three times to introduce variability in the data for the evaluation. Finally, the accuracy of each category was calculated and reported.
The material descriptions underwent systematic modifications mirroring those applied in the Q&A evaluations, except for unit mixing and synonym replacement. Note that, because of the highly templated nature of crystal structure descriptions, superfluous information in this context is better characterized as misleading information rather than simply extraneous text. During data preprocessing for LLM-Prop, all numerical values and units, such as bonding distances and angles, are replaced with a [NUM] token, to emphasize the model's focus on text-based understanding.18 Unit mixing might disrupt the preprocessing algorithm, and thus was excluded from the analysis. Synonym replacement was excluded because the original terminology was already highly specific and lacked equivalent synonyms. Furthermore, we conducted a truncation study of textual degradation to examine the model's resilience against structural and length variations in the input data, as well as to explore which aspects of the descriptions the model relies on for predictions. We manipulated the order and fraction of sentences included, testing configurations including (1) original order, which prioritizes the initial information in a description, (2) reverse order, which prioritizes the sentences from the end of a description, (3) random order, shuffling the information, and (4) sides-to-middle, which deprioritized central information. The impact of these textual degradations was quantitatively assessed by measuring the resultant prediction error in MAE.
Among the evaluated pre-reasoning models, claude-3.5-sonnet-20240620 achieved the highest accuracy across all difficulty levels with over 0.8 accuracy. Notably, the reasoning model, DeepSeek-R1, demonstrated competitive performance with over 0.85 accuracy across all difficulty levels, closely matching claude-3.5-sonnet-20240620 on easy and medium questions, and surpassing all models on hard questions with an accuracy of 0.93. The hard questions predominantly involve complex, multi-step reasoning or advanced mathematical calculations, tasks that typically present substantial challenges for single-pass pre-reasoning models. This superior performance by DeepSeek-R1's clearly highlights its inherent strength in tasks demanding deeper analytical and mathematical reasoning compared to the pre-reasoning models.
The older llama2 models performed at or slightly below the baseline score of 0.25, equivalent to random guessing, while the newer llama3.3-70B-instruct achieved comparable accuracy to gpt-4o-2024-11-20 on easy and medium questions with about 0.8 and 0.7 accuracy, respectively.
Upon implementing the expert prompt (see expert prompt), we observe consistent performance improvement across almost all models and question types. The improvement is more significant on the older models, suggesting that the expert prompt can enhance reasoning abilities with weaker baseline capabilities. However, the expert prompt provides minimal benefit for the newer pre-reasoning models on the easy questions, likely because the extensive reasoning process induced by the expert prompt contributes little to the performance on simple conceptual questions that rely primarily on factual recall. Interestingly, DeepSeek-R1 shows no performance improvement on hard questions upon implementing the expert prompt, suggesting that the reasoning capabilities of DeepSeek-R1 are already effectively saturated by its built-in iterative reasoning mechanism, such that additional explicit prompting does not further augment its performance.
We further investigated why the smaller llama2 models (llama-13b-chat and llama-7b-chat) scored lower than the baseline without the expert prompt. Despite being chat models, they sometimes failed to understand the intent when instructions were not provided. Instead of answering, they often attempted to “complete” the questions. Once the expert prompt was implemented, these smaller models could follow the instructions and attempt to solve the questions, in which case the performance improved to around and sometimes above the baseline scores. However, their overall performance remained weak due to their limited skill levels.
Overall, the observed performance trends align with expectations: more recent and larger models consistently demonstrate enhanced capabilities in domain-specific Q&A tasks compared to their predecessors. Additionally, prompt engineering demonstrated effectiveness as a strategy for enhancing model performance when handling more complex questions, especially for older or smaller models with limited baseline capabilities. On the other hand, the reasoning model, DeepSeek-R1, exhibits inherently superior performance in complex analytical and mathematical reasoning tasks, achieving high accuracy even without specialized prompting.
When trained with farthest neighbors, models generally exhibit high MAE with no clear trend as the number of neighbors increases. Most models perform worse than the KNN and RFR models except for claude-3.5-sonnet-20240620, which slightly outperforms the RFR models but still exhibits high MAE. These results suggest that, when provided with distant data points, both LLMs and traditional ML models struggle to make valid predictions. This highlights a key challenge in OOD generalization, as training examples that are too dissimilar to the test sample prevent models from capturing meaningful structure-property relationships, leading to higher prediction errors.
For the random neighbors training set, the LLMs' performance consistently improves as the dataset size increases. This suggests that randomly composed few-shot examples offer a more balanced and diverse learning environment for these models, allowing models to develop more robust generalization. The claude-3.5-sonnet-20240620 and gpt-4o-2024-11-20 models consistently outperform the RFR models as the data size increases, indicating that their more sophisticated architectures and larger training corpora enhance their ability to analyze and interpret complex data relationships more effectively. On the other hand, the smaller and older LLMs (i.e., cohere-command-r7b-12-2024 and gpt-3.5-turbo-1106) exhibit higher MAE values throughout. The random neighbors setting appears to challenge these models to a greater degree, likely due to their smaller scale and pretraining, which limit their ability to generalize effectively to diverse inputs without fine-tuning or additional data processing. However, while their overall performance remains lower, their MAE decreases more significantly with more few-shot examples, suggesting that these LLMs can benefit from more context.
The nearest neighbors represent the most relevant data points in the compositional space to the test points. As expected, the KNN shows an increase in MAE as the number of neighbors grows, since additional neighbors are more distant from the prediction target. If LLMs rely solely on the provided information without additional internal computation, a similar performance decline would be expected. Contrastingly, as the data size increases, all the LLMs show a consistent decrease in MAE and outperform the KNN model after 5 points. Among traditional models, the RFR trained directly on elemental compositions outperforms the version trained on MAGPIE features. This observation is consistent with the training data selection strategy, which was based on compositional similarity to the prediction target and thus more aligned with raw compositions than with derived features. Nonetheless, claude-3.5-sonnet-20240620 still consistently outperforms both RFR models, suggesting that advanced LLMs can capture more complex relationships in the data rather than solely relying on interpolation from the provided examples. Notably, with 25 nearest neighbors, claude-3.5-sonnet-20240620 achieves an MAE of 80.5, nearly matching the best-performing ML model, TPOT-Mat, which achieves an MAE of 79.9 on the matbench_steels dataset.54 However, it is important to note that this is not a direct comparison, as TPOT-Mat employs a 5-fold nested cross-validation method on Matbench datasets,34 which uses 80% of the data and is likely to result in better performance. The results highlight the potential of LLMs in data-lean materials property prediction tasks without the need for feature engineering, particularly given their general-purpose design and lack of task-specific fine-tuning.
The results suggest that pretrained LLMs may exhibit adaptability to new predictive challenges using ICL, particularly when data availability is limited. While their ability to extract patterns from a small number of examples is promising, their performance remains task-dependent and may not generalize across all types of property predictions. A key insight is that LLMs can be potentially valuable in early-stage research or exploratory studies in materials science, where data may be scarce or costly to obtain. One potential use case is active learning, where LLMs help identify the most informative data points for experimental validation, optimizing the data acquisition process and reducing the number of required experiments while still achieving meaningful insights. However, as the number of data points increases, most LLMs suffer from limited prompt windows, which make such applications computationally expensive or impossible, in which case fine-tuned LLMs and traditional machine learning models with dedicated training may be more effective.
To investigate the model's predictive behaviors under these different settings, we analyzed the parity plots of the claude-3.5-sonnet-20240620's predictions when utilizing 25 neighboring data points, as shown in Fig. 4. A parity plot compares the predicted values against the ground truth values. Perfect predictions fall along the diagonal line while deviation from this line indicates prediction errors. Alongside these plots, the figure also includes histograms of the top five most frequently predicted yield strength values, to investigate whether the model is merely guessing a few commonly present values (shown as the red points in the parity plots). This behavior is known as “mode collapse”, whereby a generative model can favor a certain output due to overfitting to its pretraining data or lack of generalization capability.55 Understanding mode collapse is crucial for evaluating the robustness of LLMs because it directly impacts the model's reliability and utility in practical applications. By identifying the mode collapse behavior, one can evaluate the validity of those predictions and potentially improve the performance.
In the farthest neighbors setting, the red points in the figure form horizontal lines, indicating that the model frequently predicts the same yield strength values regardless of composition. This suggests that it fails to capture the underlying relationship between composition and yield strength effectively. The histogram further reveals a strong mode collapse behavior, with the model repeatedly predicting a set of values. This suggests that the model may be defaulting to a “safe” prediction range when provided with less relevant examples. This aligns with the shortcut learning behavior observed in LLMs, where models rely on superficial correlations rather than learning meaningful patterns from the data.56 Instead of extrapolating from compositional trends, the model may be leveraging spurious cues from its training distribution, leading to repetitive and less informative predictions. In the random neighbors setting, the model shows better overall performance and a reduced mode collapse behavior. This suggests that introducing more variability into the few-shot examples helps the model to better understand the underlying patterns that predict yield strength. The nearest neighbors setting exhibits the best performance, suggesting that higher proximity can lead to more accurate predictions. The mode collapse behavior is significantly reduced compared to the farthest neighbors and random neighbors, showing a greater diversity in the model's output.
The observations show varying degrees of mode collapse based on the proximity of prompt examples to the test point. For instance, when provided with more closely related few-shot examples, the model exhibits stronger predictive signals with fewer repeated outputs. The results from Fig. 3 and 4 suggest that LLMs do not appear to develop an intrinsic understanding of structure-property relationships but instead rely heavily on contextual information from the prompt. Pretrained LLMs are uncalibrated classifiers that can be overconfident in OOD scenarios, causing them to default to high-probability responses from their pretraining data and lead to repeated or generic outputs.57–59 The mode collapse behavior and poor OOD generalization may be exacerbated by token frequency biases from overexposure to syntactic or uninformative data during pretraining,60 as well as by LLMs' limited compositional reasoning capabilities, which hinder their ability to generalize from dissimilar few-shot examples.61 Although this limits the utility of LLMs in extrapolative property prediction tasks in OOD settings, the observed mode collapse can be repurposed as a proxy for epistemic uncertainty. In the context of active learning, the mode collapse behavior could serve as a self-diagnostic tool for guiding data acquisition – when a model repeatedly generates identical outputs across varied inputs, it may reflect a lack of confidence or failure to generalize. Such occurrences can be used to identify regions of high model uncertainty where additional experimental validation is most needed.
Ranking by the degradation severity on the easy-level questions for gpt-3.5-turbo-0613, sentence reordering has the least performance drop, followed by synonym replacement, distractive info, unit mixing, and superfluous info. The performance on the hard-level questions is close to the baseline score, indicating that gpt-3.5-turbo-0613 struggles with complex queries regardless of textual modifications, and thus will not be discussed in detail. The larger error bars in medium and hard questions suggest that LLMs tend to generate more varied responses to complex and lengthy queries. In contrast, the newer and more advanced model, gpt-4o-2024-11-20, shows minimal degradation on the easy-level questions, maintaining an accuracy above 0.8, except for unit mixing. This suggests that the model is better at handling text changes and more robust than its predecessor. However, performance degradation becomes more noticeable on medium and hard questions. DeepSeek-R1, as a reasoning model, exhibits the strongest robustness among the three. Across all perturbation types and difficulty levels, it consistently achieves high accuracy, often above 0.9. Similar to gpt-4o-2024-11-20, unit mixing caused the most notable degradation, suggesting minor limitations in numerical reasoning and unit conversion. Nonetheless, its performance remains stable under all syntax-disrupting and distractive perturbations, demonstrating its strong parsing and reasoning capabilities overall.
Sentence reordering has little effect on the performance of all three models on easy-level questions, indicating that they can effectively parse and extract key information even when the natural flow of a question is altered. However, the impact becomes more significant on medium and hard-level questions, where reordering appears to disrupt comprehension more significantly. This suggests that while the models exhibit strong syntactic flexibility in simpler cases, they may rely more heavily on common question structures when dealing with more complex queries.
The slight performance drop with synonym replacement in gpt-3.5-turbo-0613 and gpt-4o-2024-11-20 suggests that both models are somewhat sensitive to changes in terminology, leading to inconsistencies in their responses. This reveals a reliance on specific wording for recognition and comprehension in materials science. Unlike humans, who can grasp the conceptual continuity behind varied expressions for flexible cognition, these models' struggles with synonym replacement emphasize the need for advanced training that prioritizes semantic networks over mere word recognition.62 Contrastingly, DeepSeek-R1 shows a slight improvement under synonym replacement, indicating that its reasoning-oriented architecture may better capture underlying semantic relationships and handle paraphrased inputs more effectively.
Introducing distractive information simulates a real-world scenario where irrelevant data often accompanies critical information, requiring sharp focus and analytical precision. Improving LLMs' ability to filter out irrelevant information is crucial for more effective information retrieval, problem-solving, and data interpretation.63 While gpt-3.5-turbo-0613 shows slight degradation on easy-level questions, both gpt-4o-2024-11-20 and DeepSeek-R1 generally maintain or even improve their performance across difficulty levels. This suggests that the added information may inadvertently help the more advanced models by reinforcing key concepts or encouraging deeper contextual reasoning, aligning with the mechanisms of guided reasoning and selective attention.
Mixing and converting the units tests LLMs' abilities to perform numerical reasoning and apply mathematical concepts within a linguistic context. The added complexity introduced by unit mixing degraded the performance of all the models, indicating challenges in handling numerical transformations embedded in text. Although some state-of-the-art LLMs support multi-modal applications and function calls to perform calculations,64 accurately identifying and converting units within large text can still be critical. Improving this ability could enhance LLMs' effectiveness in tasks such as information retrieval, data interpretation, and scientific analysis, where precise numerical reasoning is essential.
Superfluous information differs from distractive information in that it is more relevant to the questions themselves. The extent of performance degradation is likely influenced by the type and relevance of the superfluous information provided. The results show that gpt-3.5-turbo-0613 struggles significantly with superfluous information, experiencing the most severe performance degradation among all modifications. This suggests that it has difficulty filtering out non-essential details, leading to confusion or misinterpretation. In contrast, gpt-4o-2024-11-20 remains largely unaffected on easy and medium questions, but experiences moderate degradation on hard questions. Similarly, DeepSeek-R1 experiences a slight drop on medium and hard questions, though it still maintains high overall performance. These results suggest that while the more advanced models demonstrate stronger information selection capabilities, their ability to filter out unnecessary details weakens as question complexity increases. For LLMs, distinguishing the necessary information from merely relevant but non-essential details is a more challenging cognitive process, mirroring advanced human problem-solving. It requires an understanding of the problem's objective, prioritizing information based on the question, and applying only the information that will lead to the correct conclusion. This highlights a potential area for improvement in LLMs, particularly in their ability to assess and prioritize critical information in complex reasoning tasks.
Original | Distractive info | Sentence reordering | Misleading info | |
---|---|---|---|---|
MAE (eV) | 0.286 | 0.287 | 0.323 ± 0.002 | 0.398 ± 0.005 |
After adding distractive information to the material descriptions, the LLM-Prop model showed negligible degradation, indicating this application-specific model can effectively differentiate relevant from irrelevant information. This resilience, likely due to the targeted training and fine-tuning on domain-specific texts, enables it to focus on key features for band gap prediction. This showcases the potential noise-filtering capabilities of the trained and fine-tuned transformer models, which traditional ML models may suffer from.
The impact of sentence reordering increased the MAE by 12.9%, suggesting the model's reliance on the structured descriptions for accurate predictions. From the previous study on MSE-MCQs degradation, the effect of sentence reordering was less significant, indicating that larger general LLMs, which are trained with more various texts, can exhibit better contextual understanding and are less prone to order changes.
The presence of misleading information, particularly an additional sentence from another material's description, leads to a 39% increase in MAE. This substantial degradation indicates that while the model can filter out irrelevant distractive noise, it struggles considerably when faced with data that is contextually relevant to the specific prediction task. Notably, this impact arises from the addition of just a single misleading sentence, highlighting the model's vulnerability to subtle contextual inconsistencies that misdirect its predictions.
To further assess the model's robustness and determine which description elements are essential for prediction accuracy, we conducted a truncation study that involves altering the orders and lengths of the input description. As shown in Fig. 6, the description length is expressed as percent sentence inclusion, ranging from 10% to 100% and MAE is used as a measure of prediction accuracy.
When the number of description sentences is incrementally increased, the MAE rapidly decreases and is minimized at 100% sentence inclusion. Interestingly, in the random order, reversed order, and sides-to-middle configurations, the initial MAE at 10% sentence inclusion is notably lower than in the original order, with some configurations achieving nearly double the performance of the original setting. This indicates that the initial sentences may not contain the most useful information for prediction. The MAEs begin to converge around 50% sentence inclusion, beyond which differences become statistically insignificant. Notably, in the random order setting, there is virtually no variation in the MAE when incorporating three different sentence shuffles. This suggests that LLM-Prop can effectively extract key information and deliver consistent predictions, despite variations in the sentence order.
By 40% sentence inclusion, the reversed order yields the lowest MAE, indicating that sentences at the end of descriptions contain crucial predictive information. However, by 50% sentence inclusion, the performance of the reversed order begins to align with that of the random order, suggesting that central information in the descriptions may not be as crucial for prediction accuracy. Since random order includes more initial sentences than reversed order, this suggests that the first sentences may contribute less relevant details, particularly at lower inclusion percentages. Based on these insights, we developed the sides-to-middle approach, aiming to prioritize information at the beginning and the end. This approach consistently outperforms other configurations between 40% and 70% sentence inclusion, achieving the lowest MAE in this range. The error continues to decrease and is optimized at full sentence inclusion being only 5.8% higher than the original setting in MAE. This result suggests that while the original order remains optimal, the contextual framing provided by the beginning and end of descriptions is particularly important for model accuracy.
This truncation study showcases that the fine-tuned model can perform effectively even when provided with significantly reduced prompts. We found that diverging from the training setup (i.e., changing the textual order of the prompt) can sometimes result in improved performance at truncated data volumes. This counterintuitive result suggests that highly templated training or fine-tuning data can lead to unexpected effects. Consequently, this implies two key considerations: (1) training templates should be diverse to prevent models from overfitting to unimportant patterns, and (2) when using a fine-tuned model trained on a specific template, it may not always be optimal to match the template during inference. These insights highlight the potential for optimizing training costs while maintaining performance.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5dd00090d |
This journal is © The Royal Society of Chemistry 2025 |