Open Access Article
Daegun Lee
ab,
Jiwoo Choia,
Gyeong Hoon Yid,
Seok Su Sohnb,
Byungju Lee*ac and
Donghun Kim
*d
aComputational Science Research Center, Korea Institute of Science and Technology (KIST), Seoul 02792, Republic of Korea. E-mail: blee89@kist.re.kr
bDepartment of Materials Science and Engineering, Korea University, Seoul 02841, Republic of Korea
cNanoscience and Technology, KIST School, University of Science and Technology, Seoul, Republic of Korea
dDepartment of Materials Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea. E-mail: donghun.kim@kaist.ac.kr
First published on 29th April 2026
Large language models (LLMs) remain unreliable for materials science question answering because correct conclusions depend on detailed experimental conditions. Here, we show that a structured, domain-specific knowledge dataset is a critical prerequisite for trustworthy LLM-assisted question answering in materials science. Using water-splitting catalysis as a proof of concept, we curate the literature into a hierarchical, machine-queryable knowledge base encoding material synthesis, composition, and performance. This structured representation improves condition-aware retrieval and reduces context mismatches that commonly arise from superficial semantic similarity. Combined with query reformulation, it achieves 85.6% accuracy on 202 DOI-identification questions versus 21.3% for an unstructured baseline, while reducing operating cost by 39%. To assess broader free-form scientific question answering beyond exact-match retrieval, we further evaluate 202 descriptive questions using the RAGAS framework, which indicates more faithful, evidence-grounded answers. Together, these results show that structured domain knowledge can substantially improve the reliability of LLM-based materials science question answering.
Fig. 1 illustrates these challenges. For general questions, an LLM combined with an external database (a standard retrieval-augmented generation, or RAG, approach) can well produce accurate and helpful responses (Fig. 1a).12,13 Yet for domain-specific queries, such as those in water-splitting catalysis, the same framework frequently fails because it cannot correctly interpret technical terms or experimental contexts (Fig. 1b).14–16 This highlights that while RAG enhances factual grounding in general settings, it remains insufficient for specialized scientific domains requiring fine-grained contextual understanding.14
Beyond RAG, other strategies have been proposed to adapt LLMs to specialized domains, including continued pre-training (CPT) and supervised fine-tuning (SFT).17,18 While CPT and SFT can inject domain expertise, they often suffer from catastrophic forgetting or high data labeling costs.19 In contrast, RAG augments generation with external knowledge without modifying model weights, and recent studies suggest that RAG consistently outperforms unsupervised fine-tuning for factual knowledge acquisition.20 Significant progress has been made in general domains, such as BloombergGPT in finance21 and BioGPT in biomedicine,22 demonstrating that mixing domain-specific and general corpora can preserve general capabilities while enhancing expertise.
For materials science specifically, models like MatBERT and MatSciBERT have improved performance on named entity recognition (NER) tasks.23,24 More recently, billion-parameter scale LLMs like HoneyBee and LLaMAt have been developed through instruction fine-tuning and large-scale continued pre-training, respectively, often outperforming general models like GPT-4 on materials science benchmarks.25,26 In parallel, open-source LLMs have rapidly advanced, with Meta's LLaMA 3 demonstrating performance comparable to proprietary models like GPT-4 across diverse benchmarks while enabling local deployment without API costs and full reproducibility of results.27 Beyond base models, agent systems such as ChemCrow and HoneyComb have emerged, integrating expert-designed tools and curated knowledge bases to autonomously execute complex tasks.28,29 Despite these advances, benchmarks like MaScQA reveal that even state-of-the-art models with chain-of-thought prompting still struggle with conceptual errors in specialized fields.30
The original RAG framework, or conventional RAG (C-RAG), consists of four stages: indexing, retrieval, augmentation, and generation.12 However, C-RAG often fails with complex scientific queries where semantic similarity alone is insufficient.14 To address this, query reformulation techniques have evolved. Query expansion methods such as HyDE and Query2Doc generate hypothetical or pseudo-documents to improve retrieval.31,32 Query rewriting approaches like Rewrite-Retrieve-Read and RaFe train rewriter models using reinforcement learning or ranking feedback.33,34 Query decomposition methods, such as GenDec and RQ-RAG, further break down multi-hop questions into simpler sub-queries to improve reasoning transparency.35,36 Yet, a critical gap remains: the impact of database structuring on RAG performance has not been systematically evaluated in scientific domains, and existing reformulation methods often lack the domain-specific precision required for materials science.
To address this limitation, we designed a domain-aligned RAG framework that integrates structured knowledge and query reformulation for more precise retrieval and reasoning. We selected water-splitting catalysis as a representative proof-of-concept domain due to its rich but heterogeneous literature and well-defined quantitative benchmarks. Performance in this field depends on multiple interacting variables, such as catalyst composition, synthesis route, electrolyte and its pH condition, and testing protocols,37–40 making it a stringent test for retrieval and reasoning quality. In this work, we construct a structured database of water-splitting catalysts from a large-scale scientific literature, through paragraph classification, synthesis-method classification, and named entity recognition (NER). And we develop a query reformulation RAG (QR-RAG) pipeline that couples sparse lexical retrieval with dense vector retrieval for hybrid search.
Our QR-RAG approach differs from existing methods in three key aspects: (1) it combines decomposition with query optimization to preserve critical domain-specific terms; (2) it employs an adaptive two-step process that invokes decomposition only when initial retrieval fails, reducing overhead; and (3) it integrates hybrid retrieval for exact terminology matching alongside semantic similarity. We evaluate this framework with complementary benchmarks covering distinct aspects of scientific question answering. The combination of structured domain knowledge and QR-RAG improves accuracy on 202 DOI-identification questions from 21.3% for conventional RAG on raw literature to 85.6%, while reducing operating cost by 39%. We also assess 202 descriptive questions using the RAGAS framework, showing more faithful and evidence-grounded responses beyond exact-match retrieval. Together, these results highlight the value of structured domain knowledge for reliable domain-specific scientific Q/A.
For structured database construction, we employ a three-stage pipeline to systematically extract and organize synthesis information from scientific literature. In stage 1, paragraph classification assigns each paragraph to one of four classes: system, performance, synthesis, or others, enabling focused extraction from relevant content. In stage 2, synthesis method classification assigns synthesis paragraphs to one of seven classes: vapor phase, solid phase, electrodeposition, hydro/solvothermal, precipitation, sol–gel, or others. In stage 3, we apply relational named entity recognition (RE-NER) to extract five critical entities: target, precursor, solvent, additive, and substrate. RE-NER selectively extracts only those entities that are relationally connected within the paragraph. The description of each class and entity was provided in advance (Fig. S1) before proceeding with the database construction process.
The Q/A system processes user queries through a multi-step pipeline leveraging the structured database constructed above. Upon receiving a user query, the system performs query reformulation that simplifies and optimizes complex questions for improved retrieval accuracy. The reformulated queries are then used to retrieve relevant source documents from the structured database through combined dense vector retrieval and sparse lexical retrieval. Finally, the retrieved documents serve as context for the language model to generate comprehensive answers to the user's questions, even when they are complex, based on the most relevant supporting documents.
In stage 1 and stage 2 classification tasks, MatBERT was fine-tuned with 1260 and 720 training examples, while LLMs were applied with few-shot examples (Fig. 3a) using general instruction prompts with Chain-of-Thought reasoning (Fig. S2).45 GPT-4 Turbo46 and LLaMA 3.3-70B were applied with 40 and 35 few-shot examples for paragraph and synthesis classification, respectively. For HoneyBee-7B, due to the increased context length from Chain-of-Thought prompting and its limited context window, only one example per category was used in the few-shot setting. On a test set of 240 paragraphs in stage 1, MatBERT achieved an F1-score of 0.960, GPT-4 Turbo achieved 0.950, LLaMA 3.3-70B achieved 0.932, and HoneyBee achieved 0.485 (Fig. 3b). In stage 2, MatBERT and GPT-4 Turbo both achieved the highest F1-score of 0.964 on a test set of 280 paragraphs, while LLaMA 3.3-70B achieved 0.834 and HoneyBee achieved 0.480 (Fig. 3c). These results indicate that GPT-4 Turbo performs comparably to fine-tuned MatBERT in classification tasks, while LLaMA 3.3-70B shows competitive but slightly lower performance. HoneyBee, despite being a materials science domain-specific model, showed limited performance in the few-shot setting, likely due to the restricted number of examples imposed by its context window constraints. In the RE-NER task, MatBERT was fine-tuned with 300 training examples, while LLMs were applied with few-shot examples and an additional filtering step to refine the results. To address material naming ambiguity in scientific literature, we also implemented a normalization step that converts material names to their molecular formula representations (Fig. S3). This ensures consistency across different naming conventions such as “NiFe LDH” vs. “NiFeOOH” or other abbreviations commonly used in the literature. GPT-4 and LLaMA 3.3-70B were applied with 10 few-shot examples. HoneyBee was applied with 5 few-shot examples due to its limited context window. On a test set of 390 paragraphs, MatBERT achieved an average F1-score of 0.805, while GPT-4 achieved 0.947, LLaMA 3.3-70B achieved 0.755, and HoneyBee achieved 0.452 (Fig. 3d). RE-NER is a particularly challenging task because entities must be bound to their correct targets while preserving relational coherence (Fig. S4). To address this difficulty, we applied LLMs with few-shot learning using detailed NER protocol instructions (Fig. S2). The performance gap across models can be attributed to differences in parameter capacity and instruction-following capabilities. GPT-4, with its largest parameter scale and advanced instruction-following ability, significantly outperformed all other models. LLaMA 3.3-70B showed moderate performance, outperforming MatBERT in target extraction but showing lower performance in other entity types. HoneyBee, constrained by its limited context window and fewer few-shot examples, showed the lowest performance across all entity types.
The cost analysis further highlights the trade-offs. Since MatBERT and GPT-4 Turbo achieved comparable F1-scores in the classification tasks, we focused our cost comparison on these two models. MatBERT requires a large amount of training data at the beginning, resulting in significant human labeling cost, but once trained, it incurs almost no additional cost. GPT-4, in contrast, requires only a small number of examples for setup, so the initial cost is low, but each classification generates API usage fees, causing the cost to increase progressively with more data. Given their similar performance, the choice between these two models can be determined by cost efficiency depending on dataset size. We found that GPT-4 is more cost-effective for datasets below 2500 paragraphs, with MatBERT becoming the better choice for larger volumes (Fig. S5). The resulting structured database comprises 2343 papers collected from major publishers including Elsevier, ACS, RSC, and Wiley, spanning research from 2003 to 2023 across a wide range of target metal elements (Fig. S6).
| Model | DB | Method | Single | Multiple | Avg | Cost |
|---|---|---|---|---|---|---|
| GPT-4o | HTML | C-RAG | 22.2% (24/108) | 20.2% (19/94) | 21.3% (43/202) | $29.2 |
| QR-RAG | 49.1% (53/108) | 44.7% (42/94) | 47.0% (95/202) | $56.9 | ||
| JSON | C-RAG | 81.5% (88/108) | 59.6% (56/94) | 71.3% (144/202) | $14.1 | |
| QR-RAG | 90.7% (98/108) | 79.8% (75/94) | 85.6% (173/202) | $17.8 | ||
| LLaMA 3.3 70B | HTML | C-RAG | 14.8% (16/108) | 10.6% (10/94) | 12.9% (26/202) | — |
| QR-RAG | 40.7% (44/108) | 35.1% (33/94) | 38.1% (77/202) | — | ||
| JSON | C-RAG | 70.4% (76/108) | 58.5% (55/94) | 64.9% (131/202) | — | |
| QR-RAG | 80.6% (87/108) | 70.2% (66/94) | 75.7% (153/202) | — |
A cost analysis further reveals significant efficiency gains. The JSON database shortens average context length and increases the proportion of relevant passages, which reduces tokens and retries. Query reformulation incurs additional cost on the HTML database because the reformulation step requires extra LLM calls before retrieval. However, this cost is offset on the JSON database since fewer and better-matched passages are supplied to the generator, thereby reducing overall token usage and retries. Operating cost decreases from $29.2 for HTML with C-RAG to $17.8 for JSON with QR-RAG, representing a 39% reduction (Table 1). Cost analysis was performed only for GPT-4o, as LLaMA 3.3-70B is an open-source model that does not incur API usage fees.
To verify that the framework extends beyond DOI identification, we additionally evaluated 50 questions targeting numerical property extraction, including synthesis temperature, overpotential, and Tafel slope (Table S2). Using LLaMA 3.3-70B with the structured JSON database and QR-RAG, the system achieved 84.0% accuracy, which is higher than the DOI identification accuracy (75.7%) under the same configuration (Fig. S9). This result confirms that our approach effectively extracts specific scientific metrics from the literature.
Based on this evaluation framework, we systematically assessed the performance across different database and pipeline configurations. We conducted experiments on a total of 202 descriptive questions (Table S3). For GPT-4o, the JSON database with the QR-RAG achieves the highest scores across all three metrics (answer relevance: 0.717, context relevance: 0.643, faithfulness: 0.662), while LLaMA 3.3-70B showed similar trends with the highest scores in the same configuration (answer relevance: 0.769, context relevance: 0.561, faithfulness: 0.558) (Fig. 6). Although GPT-4o achieved higher context relevance and faithfulness scores overall, both models consistently demonstrated that the JSON database outperformed the HTML database, and QR-RAG outperformed C-RAG across all metrics. Such qualitative evaluation is based on LLMs, so the scores themselves do not have absolute meaning. However, they are valuable as relative indicators across different configurations. Moreover, the trends are consistent with the quantitative evaluation (Table 1), showing that the qualitative and quantitative assessments complement each other.
![]() | ||
| Fig. 6 Qualitative evaluation of water-splitting Q/A system comparing raw (HTML)/structured (JSON) databases and RAG methods using GPT-4o and LLaMA 3.3-70B. | ||
The effect of structuring follows from how the literature is organized. The three-stage pipeline separates text into system, performance, synthesis, and others, assigns synthesis paragraphs to seven common methods, and applies RE-NER to extract target, precursor, solvent, additive, and substrate in context. Records are stored in a structured database as hierarchical JSON, which enables retrieval to filter by section and method before retrieval ranking and to provide the generator with shorter and more focused passages. In practice, this reduces irrelevant context and improves retrieval precision. As the quantitative evaluation was based on questions restricted to requesting DOIs, once retrieval succeeded, the task reduced to locating factual information, leaving limited opportunity for generation errors. Thus, the observed accuracy gains (21.3% → 71.3%, 47.0% → 85.6%) serve as direct evidence that the structured database enhances retrieval precision, which in turn lowers token usage and increases the likelihood51 that the first retrieved set already contains the evidence needed to answer the user's question.
The QR-RAG provides an independent improvement by pairing query reformulation with hybrid retrieval, which removes unnecessary words while retaining essential condition terms in the user's question. Because results in water-splitting catalysis depend strongly on operating conditions, questions that omit condition terms can surface paragraphs that read similarly but were measured under different settings. By focusing the query on the stated conditions, retrieval performance improves consistently on both the raw and the structured databases, showing that query reformulation enhances retrieval precision regardless of database type.52 Furthermore, even when retrieved paragraphs exhibit high semantic similarity a common challenge in specialized scientific domains the system demonstrates reliable answer identification by leveraging contextual interpretation of domain-specific constraints (Fig. S10). The comparative analysis of model performance underscores the critical role of model architecture, specifically the synergy between context window length and parameter capacity, in scientific Q/A tasks. While HoneyBee-7B is specifically fine-tuned for the materials science domain, its performance was significantly lower than that of general-purpose models like GPT-4o and LLaMA 3.3-70B. This gap is primarily attributed to HoneyBee-7B's limited context window, which restricted the number of few-shot examples and necessitated the exclusion of CoT reasoning in certain tasks, alongside its smaller parameter capacity which affected instruction-following capabilities. Our results suggest that for complex domain-specific RAG systems, the advantage of a longer context window, which enables richer contextual guidance and the processing of multiple retrieved passages, combined with sufficient parameter scale, can outweigh the benefits of domain-specific fine-tuning on a smaller scale. A larger context window and higher capacity allow the model to act as a more effective reasoner over external knowledge, mitigating the need for internalizing all domain expertise within the model weights. To evaluate the generalizability of our approach, we tested the QR-RAG framework on databases of varying sizes and domains using LLaMA 3.3-70B with the structured (JSON) database. The smallest subset contained only the 123 gold-standard papers, and non-answer papers were progressively added to simulate increasing retrieval difficulty. Accuracy decreased as database size increased, dropping from 92.1% with 123 papers to 75.7% with the full 2343 papers. To further test generalizability across different research fields, we added a battery materials database (5000 papers) to the full OER database,53 resulting in a combined database of 7343 papers. Despite the significant increase in database size and domain diversity, the system achieved 75.2% accuracy, demonstrating that the benefits of structured databases and query reformulation extend to other domains and database scales (Fig. S11).
To comprehensively assess system performance, we employed three complementary evaluation approaches. DOI identification measures retrieval accuracy by testing whether the system locates the correct source literature. RAGAS-based assessment evaluates the quality of generated answers through context relevance, faithfulness, and answer relevance metrics. Numerical property extraction verifies the ability to extract precise scientific values such as synthesis temperature, overpotential, and Tafel slope with appropriate units. Each approach has inherent limitations: DOI identification does not assess answer quality, RAGAS metrics cannot validate the scientific correctness of numerical values, and numerical property extraction covers a subset of experimental parameters. Together, however, these evaluations provide complementary evidence of retrieval accuracy, answer quality, and quantitative information extraction capabilities.
While effective, several limitations suggest directions for future work. The entity extraction stage, referred to as RE-NER in this study, is implemented through relation-aware prompting but does not include an explicit relation extraction module. Extending the schema to encode links among entities could further stabilize retrieval for multi-step synthesis procedures. Additionally, the current database focuses primarily on experimental synthesis information, and theoretical data such as DFT calculations or computational screening results are not included. Since the paper collection targeted experimental literature, the current framework may have limitations in answering questions related to theoretical details such as functional choices, U values, solvation models, and adsorption energies. Incorporating theoretical fields could enhance the system's ability to answer a broader range of scientific questions, although careful design would be required to manage potential ambiguity or noise arising from the integration of heterogeneous data types. We plan to address theoretical data integration in future work. Furthermore, unit variations in scientific literature (e.g., mA cm−2 vs. A g−1, mV vs. V) are not standardized in the current pipeline. When users search for specific units, the system may not retrieve all relevant results due to these variations. Comprehensive unit normalization would require complex pre- and post-processing pipelines that account for various conventions and conversion factors, which we plan to address in future work. Drawing from these current limitations and our overall development experience, we emphasize that establishing standardized normalization procedures is necessary when extracting data from materials science literature. To guide future research and database construction, we propose three key best practices for domain-specific data normalization. First, material names should be converted into standardized molecular formulas to resolve ambiguities arising from diverse naming conventions and abbreviations (e.g., standardizing terms like “NiFe LDH” and “NiFeOOH”). Second, consistent unit normalization must be implemented for quantitative metrics, as variations in reporting units (e.g., mA cm−2 vs. A g−1 for current density) can easily lead to context mismatches during the retrieval stage. Finally, maintaining hierarchical relationships between materials and their experimental conditions is critical to preserve the relational coherence of the extracted data. Implementing these practices can significantly reduce noise in structured databases and provide a more reliable foundation for LLM-based scientific reasoning. The structured database contains 2343 papers after the three-stage processing and filtering. This study focuses on water-splitting catalysis, but the same design can transfer to other materials domains by adapting the label space and entity types.41
Although water-splitting catalysis was used as a representative proof-of-concept domain, the approach is general and applicable to other experimental domains that demand contextual reasoning and integration of knowledge from multiple sources. This domain-aligned LLM paradigm provides a practical route toward trustworthy domain-specialized AI assistants that can support data-driven discovery, automated literature analysis, and hypothesis generation across diverse experimental materials research fields.
:
1 ratio before generating the final answer. For Q/A answer generation and QR-RAG operations, we use gpt-4o-2024-11-20. For automatic verification of incorrect or insufficient answers, we use gpt-3.5-turbo-0125. Retrieval uses top-k = 5.
| [Context relevance(q, c) = s] |
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d6dd00028b.
| This journal is © The Royal Society of Chemistry 2026 |