Zichun Zhouab,
Han Zhangab,
Chi Songa,
Chen Ming*ab and
Yi-Yang Sun
*ab
aState Key Laboratory of High Performance Ceramics, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai, 201899, China. E-mail: chenming@mail.sic.ac.cn; yysun@mail.sic.ac.cn
bUniversity of Chinese Academy of Sciences, Beijing, 100049, China
First published on 2nd October 2025
Large language models have been extensively employed for scientific research from different aspects, yet their performance is often limited by gaps in highly specialized knowledge. To bridge this divide, in this perspective we take phosphor materials for white LED applications as a model system and construct a domain-specific knowledge base that couples retrieval-augmented generation with a numerical-querying model context protocol. By automatically extracting and structuring data from more than 5400 publications—including chemical compositions, crystallographic parameters, excitation-emission wavelengths, and synthesis conditions—we construct an artificial-intelligence agent that delivers both broad semantic search and exact parameter lookup, each answer accompanied by verifiable references. This hybrid approach mitigates hallucinations, and improves recall and precision in expert-level question-answering. Finally, we outline how linking this curated corpus to lightweight machine-learning models and even automated experimental synthesis facilities can close the loop from target specification to experimental validation, offering a blueprint for accelerated materials discovery.
The continued evolution of lighting technologies demands the development of new phosphors with advanced features, such as a broader color gamut, high quantum efficiency and excellent thermal stability. Recently, violet-light-excited phosphors have garnered significant attention for surpassing conventional blue-light-excited systems, as they hold promise for improving color rendering7 and eye protection from a high content of blue light.8 Traditionally, the search for novel phosphors has been guided by empirical guidelines based on crystal field theory and existing experimental results.9,10 Recently, the field has been increasingly embracing data-driven discovery, leveraging computational tools and machine learning to accelerate the identification and optimization of next-generation phosphors.11–16
However, the design and development of phosphors face many challenges at the level of computational simulation. For example, the 4f–5d electronic transitions of the rare-earth ions are influenced by complex physical processes, such as crystal field splitting, electron–phonon coupling and the Jahn–Teller effect.17,18 Consequently, the positions of their energy levels are sensitive to the local crystal environment and hard to fully capture by empirical rules. In this sense, density functional theory (DFT) based first-principles calculations have become the workhorse method in this field, but still face the challenge of treating the strong correlation effect of 4f electrons of the lanthanide ions.13,14,19 Recently, machine learning methods have been adopted to predict properties of materials.20–22
With the rapid development of large language models (LLMs), their applications have broken through the scope of traditional text processing. LLMs now demonstrate potential for constructing domain-specific intelligent systems and have attracted increasing attention in interdisciplinary areas such as information mining, knowledge reasoning and scientific discovery workflows.23–25 Compared with traditional models, LLMs possess multimodal comprehension and language generation capabilities, which provide new possibilities for building domain-specific intelligent systems based on the literature and databases.
However, the direct application of LLMs to precision-driven scientific domains, such as rare-earth-doped luminescent materials, is hindered by several key bottlenecks stemming from their limitations. First, at the data level, these models are confronted with two primary challenges: the lack of high-quality, specialized datasets and the temporal cutoff inherent in their training. The latter means that they lack knowledge of the latest scientific discoveries and experimental data that have emerged after their training was completed, preventing timely updates on the state-of-the-art. Second, at the algorithmic level, the general hallucination problem of LLMs evolves into a more critical challenge in scientific applications: a lack of grounding in physical and chemical laws. Lacking an understanding of underlying physical principles, a model may propose synthesis routes that violate the laws of thermodynamics or physically unstable material compositions.26 Furthermore, their precision in quantitative prediction is also severely lacking—while LLMs excel at qualitative descriptions, they perform poorly when predicting key performance parameters such as spectral peak positions and quantum yields. In this perspective, we discuss strategies to address these issues and construct a basic framework, as shown in Fig. 1. This framework aims to implement a specialized intelligent agent based on LLMs for the design of rare-earth doped luminescent materials.
(1) In terms of data dependency, RAG relies on external knowledge bases, which consist of purposely prepared documents.26 The collection of literature in PDF format from a specialized field could be directly used as the knowledge base for RAG. For better performance, however, structured literature files as described in Section 3 could be used. In contrast, fine-tuning typically requires a substantial dataset of task-specific labeled data (e.g., question–answer pairs), similar to training an LLM.29–31
(2) In terms of computational cost, RAG does not require adjustment of the LLM parameters. Its main expenses lie in building a vector database via embedding models and performing retrieval during inference. Detailed implementation will be introduced in Section 4. RAG demands modest hardware resources, but it incurs lower initial costs. In contrast, fine-tuning offers lower inference costs, but it entails much higher initial training costs than RAG. Moreover, as foundational LLMs are frequently updated, each new version often necessitates repeated fine-tuning, increasing resource consumption and maintenance complexity.32
(3) In terms of the knowledge updating mechanism, RAG leverages external knowledge bases built from domain-specific corpora to provide up-to-date and context-relevant information during inference. By contrast, fine-tuning integrates new corpora directly into the parameters (or weights) of the LLMs, allowing the model to internalize and generate knowledge from that vertical field without external retrieval during inference. In short, RAG updates the knowledge of the LLMs through dynamic retrieval, while fine-tuning through static parameter updates.26,27
(4) In terms of model performance, by acquiring information from external knowledge bases, RAG reduces the incidence of hallucinations of the LLMs through the retrieval process. It is worth mentioning that RAG generates responses that are traceable to the original literature, which makes it particularly suitable for scientific Q&A.33 In comparison, by integrating the new corpora into the parameters of LLMs, fine-tuning not only improves the model performance by reducing hallucinations, but also extends the generative capability of the LLMs to the specialized field.34,35
Firstly, when processing long, information-dense scientific papers, LLMs often face the problem of context loss. A paper's core arguments, key data, and experimental details—the information needed for a database—are often buried in the middle of the text. During retrieval, the model may excessively focus on the summary content at the beginning and end, thereby overlooking the core evidence that determines the study's validity and reliability. This leads to the extraction of incomplete data and the generation of one-sided or inaccurate insights. Secondly, the inherent rigor and complexity of scientific literature pose a huge obstacle to information extraction. These documents not only contain precise terminology and complex logical relationships, but also rely heavily on non-textual, structured data such as tables, figures, chemical structures, and mathematical equations to present key results. Current LLMs, which are primarily text-based, struggle to directly and accurately parse this multimodal information. This can easily lead to misinterpretation, distortion of data, or even groundless hallucinations, severely compromising the reliability of any database built upon it.
We take Eu2+-doped phosphors as a representative case to illustrate these obstacles. Over 50 years of research on Eu2+-doped phosphors has produced a wealth of experimental data. However, these results are scattered across more than 400 academic journals, as illustrated in Fig. 2, creating significant barriers to systematic integration. The core challenge lies in the extreme heterogeneity of this literature: record formats, terminology, and measurement methods vary widely, leading to severe information fragmentation. This knowledge silos phenomenon hinders the development of comprehensive knowledge in the field. More critically, key performance parameters—such as excitation/emission wavelengths, quantum efficiencies, and thermal quenching temperatures—are rarely presented in a structured format. Instead, they are typically embedded within unstructured text, figure captions, footnotes, or even supplementary information. This severely impedes automated extraction and large-scale analysis. To compound the issue, the reported properties for the same material often vary between publications, further undermining the overall consistency and credibility of the data.
Therefore, building an efficient and reliable database from scientific literature cannot be achieved by simply feeding raw documents to a model. A more viable path is to implement a dedicated information extraction and knowledge structuring stage beforehand. By using a data mining approach to transform relationships and core data from text and tables into a structured knowledge base, we can effectively overcome the aforementioned drawbacks and ensure the accuracy, completeness, and reliability of the data foundation for any subsequent RAG system or analysis.
In the past, scientific data mining primarily relied on two approaches: manual annotation and rule-based natural language processing (NLP) systems. Manual annotation is inefficient and prone to subjective bias, making it unsuitable for meeting the growing demand to process high-throughput scientific literature. Rule-based systems, such as ChemDataExtractor,36 OSCAR437 and ChemTagger,38 possess basic term recognition capabilities. However, they struggle with complex scientific texts that require the interpretation of implicit information, cross-sentence relationships and contextual reasoning. Moreover, these systems depend heavily on domain experts for their construction and maintenance, resulting in high costs and limited portability across domains.
Due to the limitations of traditional methods, generative approaches based on LLMs have emerged as a promising direction for scientific information extraction in recent years.39–41 Our methodology is built upon this foundation, with a process that begins with the structured preprocessing of the literature. First, we employ optical character recognition (OCR) and layout analysis tools to batch-convert the original PDF documents into Markdown format. This step is crucial as it preserves the document structure—including headings, paragraphs, tables, and lists—providing a high-quality text source for the subsequent precise information extraction. Next, we proceed to the core knowledge extraction phase. We utilize an LLM combined with designed prompt engineering to perform an analysis of the Markdown text. For the phosphor domain, our prompts are designed to automatically extract several key categories of information:
(1) Material compositions: for example, the chemical formula of the host material (e.g., Y3Al5O12, CaAlSiN3), the activator ions (e.g., Ce3+, Eu2+) and their doping concentrations, as well as any potential co-dopants or sensitizer ions.
(2) Synthesis methods: identifying the specific preparation process, such as the high-temperature solid-state reaction method, co-precipitation, or the sol–gel method, and extracting key process parameters like sintering temperature, holding time, and the use of a reducing or oxidizing atmosphere.
(3) Performance parameters: precisely capturing core optical and thermal performance data, including the peak wavelengths of the excitation and emission spectra (λex, λem), internal and external quantum efficiency (IQE/EQE), color coordinates (CIE), and thermal quenching behavior (e.g., thermal stability at 150 °C).
Finally, these extracted discrete information elements are systematically organized into standardized structured knowledge units (SKUs). Each SKU can be considered a digital profile for a specific phosphor sample, clearly documenting the material's entire identity–synthesis–performance information chain in a key-value format. These standardized SKUs serve as the cornerstone for building our phosphor knowledge database, enabling efficient support for complex downstream queries and Q&A applications. The overall process is illustrated in Fig. 3.
Thise method not only effectively reduces the interference of irrelevant information by the generative model, but also enhances the readability, controllability and embedding quality of the data. The advantages of structured processing are mainly reflected in two aspects: (1) improving the relevance and precision of information retrieval: similarity calculation based on structured semantic units significantly improves the retrieval recall rate and matching effect; (2) enhancing the contextual support capability: compared with the traditional text input, the structured data provides a clearer contextual context for LLMs to improve the accuracy and rationality of the generated content, which is especially suitable for multi-round Q&A and cross-document integration tasks.
Furthermore, a hybrid system with RAG and a model context protocol (MCP) is constructed, which combines the high recall capability of vectorized semantic search with the high-precision matching capability of queries on the structured knowledge base. This enables a layered information retrieval process that transitions from fuzzy matching to precise extraction.
Based on the structured knowledge base, we built a vector database and indexing system. To improve semantic matching, we adopted Alibaba's open-source embedding model Qwen3-Embedding-8B,42 which is currently the SOTA of open-source embedding models, according to the HuggingFace MTEB leaderboard.43 During querying, user questions are vectorized using the same model and matched against the vector database, as shown in Fig. 4.
To improve retrieval accuracy, we used a hybrid scoring mechanism that combines keyword similarity and vector cosine similarity through weighted fusion. This approach balances semantic understanding with precise keyword alignment, reducing false positives caused by overgeneralization. A similarity threshold is also applied to filter out irrelevant results, ensuring that the top-k retrieved documents are semantically relevant. The system shows promising recall and efficiency across multiple test cases, suggesting the potential of the RAG framework for scientific Q&A applications.
In the answer generation phase, new prompts from the retrieved document blocks will be constructed to supplement the user queries. Prompt engineering techniques can be used to guide the foundational LLM to generate professional answers. Customized outputs can be required. For example, the model can be assigned to play the role of an expert in the field of rare-earth doped phosphors, with an adjustable format and degree of scientific rigor for its answers. After considering the cost and performance, we selected Deepseek-R144 as the generative model, considering its relatively strong reasoning capabilities and support for the chain-of-thought mechanism, which can help produce more coherent and insightful responses. An example of the actual Q&A output is illustrated in Fig. 5.
To validate the effectiveness and reliability of our RAG system, we designed a multi-faceted evaluation framework targeting two critical capabilities: novel information processing, and precision of knowledge updates.
(1) To evaluate our RAG system's ability to process novel information, we constructed a specialized test corpus using content published after the baseline LLM's knowledge cutoff. This corpus consists of eight recent phosphor-related papers from 2025, sourced from journals such as advanced optical materials. Against this corpus, we crafted 40 questions meticulously stratified into three types to assess distinct capabilities: precise numerical extraction, recitation of experimental methods, and summarization or inferential tasks. This corpus was ingested into our RAG system, and all 40 questions were posed to both our system and a standalone Deepseek-R1 baseline.
A panel of domain experts then conducted a blind review of all outputs. Each response was scored on the following three core metrics using a three-point scale: accuracy score (0–2): assesses if the core information in the answer is correct. A score of 2 indicates complete correctness, 1 for partial correctness, and 0 for an incorrect answer. Faithfulness score (0–2): measures if the answer is fully based on the provided literature. A score of 2 indicates the answer is entirely based on the source text, 1 for being partially based on some extrapolation, and 0 for fabrication or contradiction. Completeness score (0–2): evaluates if the answer comprehensively addresses all aspects of the question. A score of 2 is given for a complete answer, 1 for a partial answer, and 0 for missing key information. The results, listed in Table 1, demonstrated a marked performance advantage for our system; our RAG model achieved an average score of 1.825 in both accuracy and faithfulness, significantly outperforming the baseline model's accuracy of 0.625. The fact that both systems provided complete answers indicates that the baseline model can understand our questions.
Model | Accuracy | Faithfulness | Completeness |
---|---|---|---|
Baseline model | 0.625 | — | 2 |
RAG system | 1.825 | 1.825 | 2 |
(2) To evaluate the dynamic update capability of our system's knowledge base, we conducted an assessment experiment. The methodology involved augmenting the system's vector knowledge base with multiple synthetic knowledge entries to test its capacity for persistent knowledge integration. Each entry, representing a distinct fictitious fact, was injected as a standalone document. The evaluation was performed by querying the system with two sets of ten questions each: a relevant set directly related to the injected knowledge, and an irrelevant set on unrelated topics. We measured system performance using two core metrics: update success rate and knowledge stability rate.
The experimental results show that the system can absorb new knowledge, achieving an update success rate of 90%. The failure occurred when the system was asked about a recent technology; it presented both the old and new answers simultaneously, indicating a lack of definitive decision-making capability when handling potentially conflicting or outdated information. On the other hand, the system's knowledge stability was excellent. It was not influenced by the new information in any of the tests with irrelevant questions, achieving a knowledge stability rate of 100%.
In summary, the system possesses the ability to integrate new knowledge, and the introduction of this knowledge does not contaminate the pre-existing knowledge corpus. However, the experiment revealed the system's shortcomings in managing knowledge version conflicts and timeliness issues. To address this limitation, we plan to implement a more sophisticated arbitration mechanism in our future work. This will involve incorporating metadata such as publication dates and impact factors for knowledge sources and performing weighted calculations, thereby enabling the system to automatically identify and select the most authoritative or current information.
Although a structured knowledge base is effective in improving retrieval accuracy and system efficiency, there are still limitations in handling numerical exact matches, as it mainly relies on semantic similarity search. To balance accuracy and flexibility, we constructed a service system based on MCP and kept the complete PDF database as a supplementary resource. Through the hybrid search strategy of the structured knowledge base and original documents, the system not only supports precise queries but also can cope with complex scientific Q&A scenarios that require divergent thinking or contextual reasoning.
(1) Precise query service: this service is engineered to overcome the numerical inaccuracies of traditional RAG, especially for querying specific data in fields like rare-earth doped phosphors (e.g., excitation/emission wavelengths). We selected MongoDB, a document-based NoSQL database, for its flexible schema and high scalability, which support real-time data updates. Its native JSON-like format is perfectly compatible with our structured semantic units, simplifying data parsing and manipulation. We encapsulated the database within an MCP server using the mcp–mongo–server module. This architecture enables highly accurate, database-level queries based on specific numerical ranges. Unlike conventional vector search, our approach transcends the top-k limitation, returning a complete set of all results that satisfy the query conditions.
(2) Multimodal visualization service: this service primarily provides visualization for crystal structures, implemented by integrating the JSMol tool. We have designed a dual-call process: when a user queries a crystal structure using a chemical formula, the model first calls the query server to precisely match the formula to its corresponding ICSD (inorganic crystal structure database) number. Subsequently, it calls the resource server to retrieve the CIF (crystallographic information file) for that number and completes the 3D visualization rendering. This not only significantly enhances the model's cross-modal understanding capabilities (from text to 3D images) but also establishes a standardized interface for integrating more modalities in the future, such as spectral diagrams and electron microscopy images.
Both servers access the MCP client, which is realized by the Cline plugin of VS Code and the overall system architecture is shown in Fig. 6. In the actual invocation, the LLM will judge the task type according to the system prompts and select the corresponding MCP server to initiate the request. An example of the actual Q&A output is illustrated in Fig. 7.
The operational flow of the system is as follows: (1) the user inputs the target performance requirements for the phosphors. (2) The system combining LLM, RAG and MCP queries the knowledge base for materials meeting these requirements; if none are found, it recommends potential candidate materials. (3) Lightweight machine learning models are employed to carry out performance predictions on the candidate materials, serving as a correction to the LLM-RAG-MCP system. (4) The system predicts possible synthesis pathways and integrates them with an experimental protocol and some attached equipment to enable intelligent material synthesis. If the synthesized material meets the target performance, the process concludes. Otherwise, the experimental results are fed back into the agent for further optimization, enabling a closed loop encompassing material prediction, design and execution of experimental synthesis, as well as a feedback mechanism.
The implementation above depends on several key infrastructures and resources, all seamlessly integrated via MCPs: (1) a database of phosphor literature, which supports semantic search and knowledge extraction, enabling an understanding of existing research findings; (2) an experimental protocol library, which incorporates standardized process templates to support LLMs in automatically generating experimental procedures; (3) an automated experimental platform, which integrates transport robots with intelligent laboratory equipment to enable end-to-end automation from sample transfer to experimental execution and result collection; (4) a machine learning model library, which brings together both proprietary and open-source models. These models are designed to perform rapid screening and preliminary performance prediction of candidate materials in the early stages of discovery.
Based on the above resources, there is a clear division of functions within the agent: the MCP-database is responsible for extracting information related to the experimental objectives from the literature and recommending potential candidate materials; the MCP-machine-learning is responsible for predicting the key performance indexes of the candidate materials by invoking the model and completing the preliminary screening; the MCP-experiment automatically generates synthesis schemes for the candidate materials and translates them into commands that can be recognized by the experimental equipment; and finally, the robotics automated experimental hardware to carry out the synthesis and characterization processes. Although the current work is still focused on the phosphor system, database and knowledge integration, the architecture shows versatility and scalability. In the future, once extended to a wider range of materials, the system is expected to reshape the materials research and development process: starting from the target properties, the AI agent will generate the material structure, predict the properties, plan the synthesis pathway and execute the experiments automatically, thus truly facilitating the realization of a “robotic scientist” and accelerating the progress towards unmanned laboratories and autonomous materials discovery. Some groups have already made efforts in this direction.24,45–48
This journal is © the Owner Societies 2025 |