Open Access Article
Hao-Tian Wangab,
Xuefeng Baiab,
Zhiling Zhengc,
Xin Zhang
ab,
Ruipeng Jinab,
Hao-Tian Anab,
Zheng-He Xieabd,
Xiu-Liang Lv
*ab and
Jian-Rong Li
*ab
aState Key Laboratory of Materials Low-Carbon Recycling, Beijing University of Technology, Beijing 100124, China. E-mail: jrli@bjut.edu.cn; lvxiuliang@bjut.edu.cn
bDepartment of Chemical Engineering, College of Materials Science and Engineering, Beijing University of Technology, Beijing 100124, China
cDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
dBeijing Energy Holding Co., Ltd, Beijing 100022, PR China
First published on 12th February 2026
The interdisciplinary nature of redox flow batteries (RFBs), spanning chemistry, materials science, and engineering, has led to a vast and fragmented body of research, hindering the efficient synthesis of knowledge. An intelligent question-answering system is therefore essential to organize this dispersed knowledge, enhance information retrieval, and lower the barrier to comprehensive understanding. In this study, we leveraged the natural language processing capabilities of large language models (LLMs) and the structured nature of knowledge graphs (KGs) to establish a chat model in the field of RFBs, named Chat-RFB. By analyzing 5353 articles related to flow batteries and deconstructing the text content, we learned contextual relationships and generated nearly 164
232 nodes, constructing 853
939 relationships among nodes. This process enhances the professional domain knowledge question-answering ability of LLMs. Given the limited research on the responsiveness of evaluation models in the flow battery field, we conducted model performance evaluations using both choice and non-choice questions. The results indicate that by incorporating a professional knowledge base, Chat-RFB enhanced the level of professional domain knowledge. Choice question accuracy was: Chat-RFB 94.9%, DeepSeek-v3 90.9%, GPT-4o 90.7%, Qwen-Max 90.4%, and Gemini-2.5-Flash 91.1%. Non-choice question accuracy was: Chat-RFB 93.3%, DeepSeek-v3 73.3%, GPT-4o 68.9%, Qwen-Max 75.6%, and Gemini-2.5-Flash 86.7%.
Large Language Models (LLMs) exhibit strong language comprehension abilities, having been pre-trained on extensive corpora, which enable them to automatically extract,21–23 semantically analyze,24 and logically reason25 literature content in response to users' natural language queries. Owing to their generalization, multitasking capabilities, and contextual understanding, LLMs hold promise for integration into agentic systems,26 where they can act as intelligent agents to assist with scientific research.27,28 However, it is important to note that LLMs face critical limitations for scientific applications. Their training on general knowledge datasets limits their expertise in specialized domains.29,30 Furthermore, the finite context window of LLMs restricts the amount of information they can process at once, which can lead to significant information loss or “contextual forgetting” during the analysis of long documents. Conventional search processes are similarly limited, often retrieving only superficial information from article abstracts. To address these challenges, the Retrieval-Augmented Generation (RAG)31 technology of artificial intelligence frameworks has emerged in response to the times. RAG enables the system to precisely retrieve the most relevant information fragments from the knowledge base based on user input. Subsequently, LLMs utilize the retrieved information as context to generate more accurate, fact-based, and enriched responses. For example, ChemReactSeek32 is an artificial intelligence platform for heterogeneous hydrogenation reactions built using a text vectorization-based RAG method. RAG frameworks that incorporate Knowledge Graphs (KGs) have emerged as a promising solution.33–39 While any form of structured data can enhance information density for LLMs, KGs are uniquely suited due to their ability to model the complex, multi-relational nature of scientific knowledge—a capability not fully matched by simpler structured formats. KGs utilize graphical models to delineate entities, concepts, and their interrelationships, providing a clear logical structure that is essential for the deep reasoning and relationship traversal required in scientific question-answering, thus facilitating the understanding of complex concepts.37,40–42 In addition, by highly summarizing the key information extracted from the entire text, KGs can provide a clear logical structure, which helps to obtain important node information in a limited space and accelerate scientific discoveries. In particular, the integration of LLMs with KGs has been explored in fields such as medicine,43 materials science,35,44 and chemistry.33,39 However, constructing an end-to-end framework that spans automated knowledge extraction, domain-specific knowledge graph construction, and RAG-integrated question answering—all supported by systematic evaluation—remains a cutting-edge challenge. Our work is dedicated to precisely this endeavor. By implementing and validating such a comprehensive framework within the field of redox flow batteries, we demonstrate its immense potential as a next-generation assistant for scientific research.
To address this gap, we developed Chat-RFB. This domain-specific intelligent assistant integrates an LLM with a structured knowledge base, enabling accurate and efficient retrieval of knowledge related to flow batteries. In this study, we employed an LLM to parse text and extract keywords from an extensive corpus of over 5000 relevant studies of literature in a high-throughput manner. The procedure yielded a KG comprising 164
232 nodes and 853
939 relationships, with each article's Digital Object Identifier (DOI) serving as a unique identifier. We developed an automated test set to evaluate our system's capacity for expert-level analysis and summarization within the flow battery domain. Testing revealed that the performance of our system, Chat-RFB, surpasses that of the native model. And during the testing process, by establishing a knowledge graph of node information throughout the text, some important information in the text that cannot be summarized can be queried, which effectively broadens the system's usage scenarios. The workflow of Chat-RFB is illustrated in Fig. 1. Technical implementation and applications are detailed in the subsequent sections.
From the Web of Science database, we exported the DOI information for all retrieved papers. Leveraging these DOIs, the full text of the literature was subsequently downloaded locally using high-throughput methods. Specifically, we employed a customized Python script utilizing requests and langchain_text_splitters libraries to automate the retrieval of PDF files from publisher websites. Given the model's context token input limitations, long texts were segmented into manageable chunks. To achieve this, we utilized a character-based text splitter (CharacterTextSplitter), configuring it to divide the text based on space characters. Each chunk was set to a maximum size of 20
000 characters, with an overlap of 500 characters between adjacent chunks. This approach ensures that semantic continuity is maintained across segment boundaries while respecting the model's input limits (Fig. S2 and S3). DeepSeek-v3 was then employed for high-throughput extraction of node and relationship information from these segmented text paragraphs. For the convenience of subsequent retrieval, we use the prompt engineering method to constrain LLMs to add label information on node outputs, which facilitates retrieval and increases the information content of nodes. The task and output format are defined in the prompt words (Fig. S1), and a formatted JSON file is generated (Fig. S4 and S5). To quantitatively assess the quality of the information extraction, we established a rigorous manual evaluation process conducted by domain experts. We framed the task as a decision-making process: for each potential piece of knowledge in the source text, the model must decide whether to extract it as an accurate and relevant relationship. This framing allows for the application of a standard evaluation framework based on the concepts of true positives, false positives, and false negatives.
The evaluation was performed by two researchers with expertise in the RFB field, who independently assessed a random sample of 500 extracted node-relationship triplets against their original source literature. Any disagreements were resolved through discussion to reach a consensus. The evaluation criteria were defined as follows:
• True positive (TP): an extracted relationship is considered a TP if it is both factually accurate according to the source text and represents a meaningful, key piece of information relevant to the RFB domain.
• False positive (FP): an extracted relationship is considered an FP if it is factually incorrect, a misinterpretation of the source text, or a hallucination not present in the literature.
• False negative (FN): a FN occurs when a key, explicit piece of information clearly stated in the source text was not extracted by the model.
This framework allows us to calculate standard metrics such as precision and recall to robustly evaluate the performance of our knowledge extraction pipeline.
![]() | ||
| Fig. 2 Multi round dialogue context stitching method. In order to overcome the contextual limitations of the model, the data information used in the previous text was removed. | ||
(1) Generate Cypher queries: first of all, the LLM extracts keywords from the user's questions to identify inference nodes or relational information. It constructs Cypher query statements to access the Neo4j database and retrieve relevant information.
(2) Execute queries and retrieve data: these Cypher queries are executed on the Neo4j database, retrieving data pertinent to the user's queries from the KGs.
(3) Generate answers: the LLM uses the retrieved KG data as a prompt to comprehend and analyze user questions, developing precise and professional responses.
(4) Generate new conversation content: due to the LLM's context length limitation, it is not possible to save an unlimited number of data query results from previous interactions as part of the conversation history. When starting new conversation content, the system deletes previous search data while retaining earlier conversation content. Upon receiving a new question, steps 1–3 are repeated.
LLMs are accessed via the OpenAI extension library API, with specific version details provided in Table S1. Apart from the context length limitation, all models share the same parameters.
(1) Excellent (completely correct): the answer fully meets the question's requirements, is accurate, logically clear, and perfectly resolves the questioner's doubt.
(2) Good (partly correct): the answer is somewhat correct but may only cover some of the question's key points or have slight inaccuracies.
(3) Normal (no obvious errors, but the answer is too broad and lacks professionalism): the answer has no obvious errors but is too general and lacks the necessary depth and focus on the core of the question.
(4) Poor (obvious errors exist): the answer contains obvious misinformation that may mislead the questioner and fail to solve the problem.
For the convenience of evaluating the model's capability, we consider answers that do not contain serious errors (with a level higher than poor) to be qualified.
232 nodes and 853
939 relationship links.
As shown in Fig. S2–S5, the LLM can accurately identify the core elements of an article, including research content, research methods, and theoretical concepts. Through customized prompt engineering, this information was successfully converted into a structured JSON format.47 This step is crucial for the construction of a KG, as it directly affects the quality and accuracy of nodes in the graph.
To further ensure quality, we manually reviewed 500 nodes and their related relationships, as extracted by the LLM.48 We evaluated the effectiveness of the LLM in extracting text entities during the process (Fig. S6). The model achieved a true positive rate (TP) of 97.8% in accurate and comprehensive information extraction, with an accuracy rate of 2% in false negative (FN) examples where key information is missing, and a false positive (FP) example rate of 0.2% in inaccurate information extraction. After comprehensive evaluation, the F1 score was as high as 0.9889. The LLM performed well in the task of extracting text information from liquid flow battery articles given prompt engineering, which were subsequently used in large-scale automated extraction processes. Subsequently, we annotated the information in bulk and organized it into the Neo4j database. Ultimately, we can obtain an LLM intelligent system that integrates KGs. For more detailed information, refer to the source code.
Soft–hard zwitterionic trappers (SH-ZITs)20 are a kind of novel additive material for developing efficient and high-performance battery technology due to their high solubility, stability, and electrochemical performance. Because of effective prompt engineering guide (Fig. S10), the LLM identified the problem content and used a simple Cypher statement to query information directly related to SH-ZITs in KGs. Fig. 3 is a schematic diagram of the relationship nodes centered around SH-ZITs, which expresses the highly summarized content of the literature. The KGs not only provide researchers with a comprehensive and multidimensional material information database, but also offer strong data support for the research and development of materials science. Researchers can access the chat system through natural language dialogue to obtain cutting-edge information in the field of flow batteries. This structured data representation enables researchers to intuitively grasp the relationship between materials and their potential value in various application scenarios.
The integration of KGs and LLMs to improve their responsiveness has been implemented in various fields.33,49,50 We used DeepSeek-v3 as the base model and the RAG method to construct KGs based on literature in the field of flow batteries. We generated Chat-RFB, which improved the model's information acquisition ability in the field of flow batteries and enhanced its ability to answer professional questions. As shown in the comparison between the left and right sides of Fig. 4, we compared the DeepSeek-v3 model enhanced by KGs with the native LLM and the commonly used LLMs: GPT-4o, Qwen-Max and Gemini 2.5 Flash. For specialized queries such as “What is SH-ZIT in flow battery?”, general-purpose LLMs often produce ambiguous or incorrect responses due to knowledge limitations or hallucinations. In contrast, Chat-RFB, which leverages Cypher queries on structured KGs, retrieves accurate definitions and source references directly from the literature. This structured approach significantly reduces the risk of hallucinations, ensuring that responses are factually grounded and traceable to verified sources.
It is worth noting that compared to the network retrieval enhanced models DeepSeek-v3 (Web Search) and GPT-4o (Web Search), Chat-RFB can demonstrate higher information accuracy. The specific Q&A process is shown in Fig. S13–S19. When asked about the experimental details of SH-ZIT, such as the Gaussian simulation calculation method used in the literature, detailed experimental data cannot be obtained due to the one-sidedness of network search. In contrast, Chat-RFB can obtain real-time parameter data for DFT calculations and provide accurate and effective answers due to the rich content of real-time database queries. This is very meaningful for researchers to conduct experimental comparisons under the same conditions. This example highlights how KG-enhanced LLMs improve scientific accuracy, making them more reliable for domain-specific research applications. This process showcases how the KG's explicit modeling of relationships—from a material (‘SH-ZIT’) to its source literature (‘DOI’), and from that literature to its specific ‘calculation methods’—enables the system to answer complex, multi-step queries. Such queries are often intractable for models relying solely on unstructured text or less relational data structures. Chat-RFB can now provide accurate information sources, effectively addressing the limitations of LLMs in citation and source tracing. As users inquire about the model, it can further understand the contextual content, conduct deep searches, and provide users with highly relevant literature indexes. Chat-RFB can now provide accurate information sources, effectively addressing the limitations of LLMs in citation and source tracing. The nearly 4% improvement in choice questions suggests that Chat-RFB helped refine factual accuracy, but its impact was limited since many answers were already within the LLMs' pretrained knowledge. In contrast, the 16–22% gain in non-choice questions highlights the key role of KGs in reducing hallucinations and improving response completeness. Unlike choice tasks, non-choice questions require retrieval, synthesis, and articulation of specialized knowledge, whereas purely pretrained models often struggle. By leveraging structured retrieval, Chat-RFB ensures that responses are grounded in verified literature, enhancing accuracy. This indicates that KG-enhanced LLMs excel in complex reasoning tasks, while their impact on fact-based recall is more modest.
Subsequently, we used 450 choice questions to analyze the performance improvement effect of the Chat-RFB model quantitatively. As shown in Fig. 5a, the current general model performs similarly, and with the assistance of KG data, Chat-RFB achieves an accuracy of 94.9% (Table S3). Moreover, LLMs are commonly used in natural language dialogue, so we not only prepared choice questions with fixed answers, but also created a test set consisting of 45 non-choice questions, reviewed by domain experts. As shown in Fig. 5b, the performance comparison results of general LLMs and Chat-RFB optimized using KGs on non-choice questions are presented. Fortunately, with the guidance of KG data, the hallucination of LLMs was effectively suppressed. The model qualification rate reached 93.3%. Beyond overall qualification rates, a closer look at fully correct (“excellent”) responses further highlights Chat-RFB's advantage. Among the 45 non-choice questions, it achieved 27 fully correct answers (60.0%, Table S2), significantly outperforming the baseline models, which had less than 30% in this category. This suggests that Chat-RFB not only improves general accuracy but also enhances the precision and completeness of responses. The substantial gap indicates that KG integration helps the model generate more reliable, domain-specific answers, reinforcing its effectiveness in handling complex scientific inquiries.
Chat-RFB is cleanly decoupled into two core modules: a persistent knowledge base and a replaceable language model interaction layer. This design separation makes swapping underlying LLMs a standardized engineering task rather than a disruptive system overhaul. To address the risk of specific API versions being deprecated or becoming inaccessible in the future, we have planned a clear migration path:
Migrating to other commercial LLM APIs: we handle interactions with LLMs through a dynamic API invocation module. This module essentially acts as a wrapper layer calling different Python libraries (e.g., OpenAI, anthropic, google.generativeai). We designed a Priority and Fallback mechanism: the system first attempts to invoke the preferred model (currently DeepSeek-v3). If the API call fails (e.g., due to network errors, API deprecation, or access permission changes), the module automatically catches the exception and seamlessly switches to fallback models (such as GPT-4, Gemini Pro, etc.) in a predefined priority order, ensuring service continuity.
Furthermore, it is worth emphasizing that during the knowledge extraction phase, LLMs are primarily used for one-time, offline batch processing. The outcome—the constructed knowledge graph—is an independent and persistent asset. Real-time API dependency is primarily manifested in the question-answering generation phase, which is precisely the core issue addressed by our modular and dynamic invocation mechanism.
However, due to the fact that the construction of the KGs only includes the relationships in the field of flow batteries to effectively obtain key content information of the model, and has not yet introduced knowledge from other fields, the model currently only remains at the level of summarizing known knowledge, lacking innovative thinking on related topics.
232 nodes and 853
939 relationships. By leveraging KGs, Chat-RFB demonstrates significantly improved performance, achieving 94.9% accuracy in specialized question-answering tasks while reducing hallucinations compared to general LLMs. To quantitatively validate these results, we also designed a novel evaluation method combining choice and non-choice questions to assess the model's understanding of complex scientific problems. Functionally, Chat-RFB enhances literature retrieval, knowledge structuring, and automated reasoning, proving to be a valuable tool for energy storage research that, in comparison with online-search-augmented LLMs, shows a stronger ability to find details in professional fields. The combination of KGs and LLMs represents a crucial step forward for artificial intelligence in scientific research. The future of Chat-RFB involves expanding the knowledge base, integrating cross-domain expertise, ensuring real-time updates, and applying the system to other energy storage technologies. It is anticipated that such an integrated system will offer automated and intelligent support for scientific research, production, and customized applications, strengthening its role as an AI-driven research assistant and accelerating progress in sustainable energy storage.
Supplementary information: supplementary material 1: the LLM information and prompt word information used in this study, as well as examples of KG invocation. Supplementary material 2: the test set table used in this study includes choice and non-choice questions, with each row containing questions, options (choice question), answers, model answers, and evaluation results. See DOI: https://doi.org/10.1039/d5dd00494b.
| This journal is © The Royal Society of Chemistry 2026 |