Open Access Article
Xuefeng
Bai
ab,
Zhiling
Zheng
c,
Xin
Zhang
*ab,
Hao-Tian
Wang
ab,
Rui
Yang
ab and
Jian-Rong
Li
*ab
aDepartment of Chemical Engineering, College of Materials Science & Engineering, Beijing University of Technology, Beijing 100124, P. R. China. E-mail: jrli@bjut.edu.cn; zhang.xin@bjut.edu.cn
bState Key Laboratory of Materials Low-Carbon Recycling, Beijing University of Technology, Beijing 100124, China
cDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
First published on 9th January 2026
Large Language Models (LLMs) have the potential to transform chemical research. Nevertheless, their general-purpose design constrains scientific understanding and reasoning within specialized fields like chemistry. In this study, we introduce MOFReasoner, a domain model designed to enhance scientific reasoning, using Metal–Organic Framework (MOF) adsorption as a case study. By employing knowledge distillation from teacher models and Chain-of-Thought (CoT) reasoning extracted from a corpus of over 8242 research articles and 500 reviews, we developed a domain-specific chemical reasoning dataset. Using domain-specific chemical reasoning datasets, general chemistry datasets, and general reasoning datasets, the LLMs were fine-tuned. The model's performance was evaluated across four tasks: experimental studies, chemical mechanisms, application scenarios, and industrialization challenges. MOFReasoner outperformed existing general-purpose models, such as GPT-4.5 and DeepSeek-R1. Furthermore, the model achieves prediction accuracy comparable to DFT, enabling material recommendations. This work underscores the potential of integrating domain-specific knowledge, CoT reasoning, and knowledge distillation in creating LLMs that support scientific inquiry and decision-making within the discipline of chemistry.
Despite the enhancement of expertise of LLMs through fine-tuning and RAG, their ability to tackle complex problems, especially those requiring multi-step chemical reasoning, remains insufficient. This limitation constrains their applications in areas such as materials design and chemical reasoning Q&A.23,24 Therefore, LLMs need to possess scientific reasoning abilities akin to those of scientists. Once they acquire such thinking skills, they can derive accurate conclusions through more rigorous logical deduction. These accurate conclusions will further enhance the performance of LLMs in areas such as materials design, performance prediction, and multi-objective optimization, enabling them to achieve the experimental paradigm of autonomous AI research in the future.
Chain-of-Thought (CoT) reasoning enhances LLMs by enabling structured, step-by-step logical inference, allowing them to tackle answer chemical questions with improved accuracy and interpretability.25,26 This is particularly beneficial for tasks requiring multi-step reasoning, such as reaction prediction and experimental design, where breaking down intricate processes leads to more reliable and scientifically sound conclusions. However, the reasoning patterns of LLMs trained on general knowledge still differ significantly from those used in scientific research, as scientific reasoning often involves a process of making numerous hypotheses followed by verification. Therefore, it is crucial to further train LLMs in specialized domain knowledge and scientific reasoning. Building such a domain-specific reasoning model requires fine-tuning based on domain-specific CoT data. Knowledge distillation provides an efficient way to transfer domain expertise from larger parameter models or structured knowledge sources to small parameter models.27–29 By integrating CoT reasoning with knowledge distillation, LLMs can acquire not only more domain-specific knowledge but also domain-specific thinking. This approach enables them to achieve structured, step-by-step reasoning, leading to more reliable inference and more efficient knowledge utilization. Recently, ChemMatch and the ScholarChemQA dataset30 have highlighted the potential of lightweight, domain-specific models for chemical QA; in contrast, our work emphasizes equipping LLMs with multi-step scientific reasoning through literature-derived CoT data and knowledge distillation. It should be noted that while CoT reasoning improves interpretability and task performance, recent studies suggest that such outputs may reflect structured pattern generation rather than genuine scientific understanding. As a highly designable class of three-dimensional materials, MOFs, which are widely used in adsorption separation,31 catalysis,32 and other fields,33,34 greatly benefit from AI assistance due to their vast potential for synthesizing a diverse range of materials. Herein, taking the field of MOF adsorption as an example, we extracted domain-specific reasoning pathways from scientific papers and refined them with the aid of large-parameter language models to construct a domain-specific CoT database and trained MOFReasoner (as shown in Fig. 1). The model can be found at https://huggingface.co/baixuefeng/ChemReasoner-7B. Specifically, through a hard-label knowledge distillation approach inspired by recent large-model compression studies,35 we conducted high-throughput analysis with large-parameter, long-text teacher LLMs, guiding them to extract and structure chain-of-thought reasoning from the literature, and organized the results into a domain-specific reasoning dataset. Additionally, we used these teacher models to transform existing chemical datasets into chemical reasoning datasets. By integrating these datasets with general CoT datasets, we constructed a reasoning model for chemistry named MOFReasoner. MOFReasoner, with its enhanced structured reasoning capabilities and effective integration of chemical knowledge, significantly outperforms ChatGPT and DeepSeek on a dataset consisting of four types of tasks in Q&A testing. Moreover, MOFReasoner can be further coupled with existing knowledge bases and knowledge graphs,36–38 and through its robust reasoning capability, it recommends materials that are consistent with DFT calculation results.
![]() | ||
| Fig. 2 Knowledge distillation from teacher models on the article: (a) distillation of research article knowledge using DeepSeek-V3; (b) distillation of review article knowledge using Qwen-Turbo. | ||
In the case of review articles, which consist of experts' summaries of existing research and contain profound scientific insights, we employed the long-text model Qwen-Turbo to distill and summarize the scientific perspectives. We then transformed these insights into question–answer pairs (as shown in Fig. S12–S16). Subsequently, each question-and-answer pair is matched with the original comment content as context, and then presented to the LLM. The LLMs provide a detailed CoT process from multiple dimensions (as shown in Fig. 2b and S17–S21). The current pipeline processes only textual information, and visual data such as adsorption isotherms, PXRD patterns, and microscopy images were not directly included, which may lead to partial underrepresentation of information conveyed exclusively through figures.
In addition to constructing a domain-specific dataset, we also utilized a general chemistry dataset and a general reasoning dataset to enhance the model's chemical knowledge and reasoning capabilities. The camel-ai chemistry dataset,39 which includes 20
000 chemistry questions across 25 topics, serves as an excellent foundation dataset for general chemistry knowledge. However, since this dataset includes only questions and answers without detailed problem-solving procedures, we applied DeepSeek-R1(671B), an inference LLM, to better equip the LLMs with comprehensive chemistry knowledge and CoT. As shown in Fig. S22, the LLMs delivered exhaustive reasoning processes through logical inference. The CoT dataset utilized STILL,40 a slow-thinking reasoning dataset, which did not require additional processing, as it is originally presented in the CoT format within the SFT structure (Fig. S23–S25).
Subsequently, we conducted a systematic validation of the dataset to ensure the quality and reliability of the training data. For datasets with clearly defined standard answers, we employed DeepSeek-V3 for answer verification to assess the reasoning accuracy and response quality of the LLMs (Fig. S26). Specifically, we compared the standard answers provided in the dataset with the responses generated by the LLMs, evaluating their consistency. If the generated answers matched the standard ones, we considered the reasoning process to be reliable. For literature-extracted datasets without predefined standard answers, we adopted a hybrid approach combining LLM-based filtering with human verification. Initially, LLMs were used to screen the data, after which the original text from the research papers and the reasoning process of LLMs were simultaneously provided to a validation model. This new model then assessed the logical soundness of the reasoning. In cases where the responses were ambiguous or controversial, domain experts were consulted for further judgment. As shown in Fig. 3, the validation model, guided by prompts and review content, effectively identified errors in long-text model responses and provided original text excerpts as supporting evidence.
![]() | ||
| Fig. 3 Construction of domain-specific datasets via validation models and manual evaluation of large model responses. | ||
Finally, as shown in Fig. S27, we performed knowledge distillation by training the student model to emulate the reasoning behaviors of the teacher model (DeepSeek-R1) on general chemistry datasets. This procedure achieved only a ∼50% success rate, reflecting the difficulty of answering challenging out-of-domain chemistry questions where relevant knowledge is often absent from the pretraining corpus. In contrast, when distillation was conducted using research papers and reviews, the model could rely on contextual information provided in the documents, leading to an accuracy exceeding 90%. This demonstrates the importance of context-augmented reasoning: rather than recalling memorized facts, the model synthesizes information from scientific literature into structured reasoning traces. The final data distribution is shown in Fig. S28, with a total of 35.8 K data points utilized for LLM training.
In this work, we fine-tuned a small-parameter reasoning LLM, DeepSeek-R1-Distill-Qwen-7B, using the LLaMA-Factory framework. Specifically, we employed supervised fine-tuning to adapt the model to domain-specific tasks and utilized low-rank adaptation to enhance efficiency by reducing trainable parameters while maintaining model performance. This approach enabled efficient adaptation of the LLM with reduced computational cost and memory footprint. As shown in Fig. S29, after a single epoch of training comprising 717 steps, the training loss was reduced to 0.8036.
When presented with scientific questions, the trained MOFReasoner is capable of reasoning logically like a scientist and providing well-founded answers. As shown in Fig. 4a, several typical reasoning pathways utilized by MOFReasoner are demonstrated, including understanding the background, application of knowledge, analysis integration, reasoning expansion, solution evaluation, conclusion formation, and open exploration. In addition to knowledge-based Q&A tasks, MOFReasoner can also generate ideas when prompted. In such cases, it follows a scientific reasoning chain that involves steps like extracting key points, reviewing historical studies, identifying core problems, proposing central ideas, and providing verification strategies and hypothesis-testing procedures. While these reasoning processes are somewhat similar to the CoT patterns found in the training dataset, MOFReasoner adapts its reasoning pathways depending on the nature of the question, indicating that MOFReasoner has effectively learned scientific reasoning through supervised fine-tuning. It is important to note that these reasoning pathways are not manually predefined templates, but rather patterns learned from diverse chain-of-thought examples distilled from research articles, review papers and general CoT datasets. Different question types naturally elicit different combinations of these learned patterns, so the pathways shown in Fig. 4a represent a post hoc summary of recurrent reasoning behaviors rather than fixed decision routes. As illustrated in the expected reasoning path shown in Fig. S30 and further demonstrated in Fig. S31 and S32, compared with DeepSeek R1, MOFReasoner exhibits a more disciplined scientific reasoning style, characterized by systematic contextualization, theory-grounded analysis, and coherent integration of evidence.
![]() | ||
| Fig. 4 (a) Examples of MOFReasoner's reasoning process; (b) four types of tasks for large language model evaluation. | ||
To further validate that MOFReasoner has not only learned to reason but also acquired domain knowledge for answering specialized questions, we designed a benchmark consisting of four task categories: experimental studies of MOFs, chemical mechanisms of adsorption, applications of MOF-based adsorbents, and industrialization-related issues (as shown in Fig. 4b and Tables S5–S7). Each question in these tasks was broken down into multiple scoring points. The complete text of all evaluation questions and the detailed scoring points associated with each question are provided in the SI Section S3 to ensure full transparency and reproducibility. Domain experts evaluated the responses based on four criteria: a correct answer (+1), a correct but imprecise answer (+0.5), a wrong or controversial answer (−0.5), and a serious error answer (−1). Key missing information in the model's response was marked as “missing.” Since the correct content was already rewarded, no additional penalty was applied for missing points. All models were assessed using exactly the same expert-curated questions and scoring scheme.
As shown in Section S3 and Fig. S33–S104, when comparing different LLMs, we found that the fine-tuned MOFReasoner consistently provided precise answers, addressing the core of each question, without producing severe errors or misleading information. For instance, when asked “How are the dynamic and static adsorption performances of MOFs usually evaluated?”, the model correctly distinguished that dynamic adsorption tests employ breakthrough experiments, while static adsorption involves measuring adsorption isotherms. However, possibly due to the imbalance between reasoning chains and final answers in the training dataset (with reasoning tokens significantly outnumbering answer tokens), and the fact that research papers often focus narrowly on single points, MOFReasoner's responses tend to be concise. After thorough reasoning, it retains only the most credible conclusions. For example, for the question “How to determine the adsorption sites in MOF adsorbents?”, MOFReasoner conservatively reported DFT calculations and GCMC simulations as methods, while omitting single-crystal X-ray diffraction that was considered during reasoning. As summarized in Table 1, MOFReasoner achieved the highest score of 25.5, significantly outperforming its base model DeepSeek-R1-Distill-Qwen-7B, the reasoning model DeepSeek-R1-671B, and even the widely recognized GPT-4.5 and o1 models. Notably, in our benchmark, we observed that GPT-4.5 and o1 occasionally generated literature-style references that were inconsistent with the underlying scientific content or could not be verified (Fig. S74, S83 and S101). Additional control experiments indicate that this performance improvement is not solely due to increased exposure to domain-specific terminology. When trained using only final answers, the resulting model showed limited ability to integrate multiple physicochemical factors and often failed to articulate coherent structure–property relationships relevant to MOF adsorption. Furthermore, comparisons between different model initializations suggest that starting from a reasoning-aligned model facilitates the learning of chemically meaningful reasoning patterns, which are more critical for adsorption-related analysis than increasing the model size alone (Table S8 and Fig. S105–S120).
| Model | Correct | Inaccurate | Wrong or controversial | Serious error | Missing | Total score |
|---|---|---|---|---|---|---|
| MOFReasoner | 25 | 2 | 1 | 0 | 10 | 25.5 |
| DeepSeek-R1-Distill-Qwen-7B | 15 | 13 | 13 | 23 | 20 | −8 |
| DeepSeek-R1-Distill-Llama-8B | 13 | 8 | 8 | 18 | 22 | −5 |
| Qwen-Max | 26 | 16 | 10 | 12 | 9 | 17 |
| Qwen-Plus | 20 | 11 | 7 | 8 | 15 | 14 |
| QwQ-32B | 24 | 11 | 9 | 13 | 11 | 12 |
| DeepSeek-R1-671B | 25 | 14 | 9 | 9 | 10 | 18.5 |
| o1-preview | 26 | 17 | 9 | 16 | 9 | 14 |
| GPT-4.5-preview | 26 | 9 | 11 | 9 | 8 | 16 |
The capability of reasoning large models should not be limited to Q&A tasks but should extend to providing meaningful scientific assistance. To further illustrate this potential, we tested MOFReasoner with a rarely mentioned guest molecule in the dataset (benzothiophene) and tasked the model with identifying metal clusters that may exhibit strong binding affinity. As shown in Fig. 5a, MOFReasoner reasoned through factors such as coordination strength and charge density and paid particular attention to the sulfur atom in benzothiophene. During the reasoning process, MOFReasoner comprehensively considered factors such as the Lewis acidity of the metal centers, the size and charge density of the metal ions, electronic structure, coordination environment, geometric configuration, and adsorption enthalpy. We observed that MOFReasoner struggled significantly to distinguish between Zn and Co (Section S4, Table S9) before ultimately ranking the metal ions as Zn2+ > Co2+ > Cu2+. In contrast, both GPT-4.5 and o1 produced the ranking Cu2+ > Co2+ > Zn2+ (Fig. 5b). This case also reveals limitations in current reasoning behaviors. As shown in the benzothiophene adsorption example and its expected reasoning path (Fig. S121), MOFReasoner shows difficulty in consistently weighting multiple competing physicochemical factors, while the reasoning trace of DeepSeek R1 (Fig. S122) does not explicitly incorporate coordination geometry or framework-level constraints. Through subsequent DFT calculations (Fig. 5c–e), we found that although none of the models initially selected the optimal Co paddle-wheel structure, the Zn and Co paddle-wheel configurations recommended by MOFReasoner exhibited substantially stronger binding affinities than the Cu paddle-wheel structure suggested by GPT-4.5 and o1. Specifically, the Co paddle-wheel structure outperformed Zn by 14.21 kJ mol−1 and Cu by 25.96 kJ mol−1, indicating that the Co metal node forms a stronger interaction with benzothiophene and therefore provides a more favorable adsorption configuration. These results indicate that, while MOFReasoner's reasoning still deviates from the actual optimal choice, its inference process can provide useful qualitative guidance and serve as a proof-of-concept example for assisting scientific reasoning tasks.
Supplementary information (SI): detailed methodology for dataset construction and knowledge distillation, chain-of-thought extraction procedures, model training and evaluation details, as well as supplementary tables and figures supporting the main text. See DOI: https://doi.org/10.1039/d5dd00429b.
| This journal is © The Royal Society of Chemistry 2026 |