A dual-mode large language model assistant for on-surface reactions via fine-tuning and retrieval-augmented generation

Juan Xiang; Qi Huang; Xinyi Zhang; Tairan Yang; Zhiwen Zhu; Chanyu Li; Liangliang Cai; Qiang Sun

doi:10.1039/D6SC01168C

A dual-mode large language model assistant for on-surface reactions via fine-tuning and retrieval-augmented generation

Juan Xiang,^a Qi Huang,

^a Xinyi Zhang,

^a Tairan Yang,^a Zhiwen Zhu,^a Chanyu Li,^b Liangliang Cai^a and Qiang Sun

*^ab

Author affiliations

* Corresponding authors

^a Materials Genome Institute, Shanghai University, 200444 Shanghai, China
E-mail: qiangsun@shu.edu.cn

^b Qianweichang College, Shanghai University, 200444 Shanghai, China

Abstract

Surface reactions underpin catalysis, nanomaterials, energy conversion, and molecular-scale fabrication, yet the field suffers from fragmented knowledge dispersed across unstructured literature, hindering systematic analysis and data-driven discovery. Existing chemical databases and language models inadequately capture the domain-specific semantics and experimental parameters unique to on-surface reactions. Here, we present an integrated framework that transforms the dispersed surface-chemistry literature into a structured, machine-readable knowledge and leverages it to develop a domain-specialized large language model (LLM) assistant for on-surface reactions. We curated and semantically screened hundreds of thousands of publications to construct the surface-chemistry corpus, from which we extracted 44 predefined reaction attributes across more than 44 000 studies of surface reactions. These structured records were used to build both a high-quality reaction database and a domain-specific question–answering dataset. On this basis, we developed a dual-mode LLM system that combines a parameter-efficient fine-tuned reasoning model with a dual-source retrieval-augmented generation (RAG) framework, enabling both deep inference and verifiable retrieval of experimental parameters. Evaluations demonstrate that the fine-tuned LLM outperforms existing chemistry-oriented language models on surface-chemistry question–answering, achieving a Bert-F1 score exceeding 0.8. Incorporation of the RAG framework further improves factual accuracy, completeness, and reasoning consistency by grounding responses in the retrieved literature and structured reaction data. Latent-space analyses reveal that domain-specific fine-tuning reorganizes internal representations toward task-oriented coherence. This work establishes a scalable pathway for converting fragmented surface-chemistry knowledge into an intelligent platform, paving the way toward data-driven prediction, experimental planning and automated reasoning in on-surface reactions.

Supplementary files

Article information

DOI: https://doi.org/10.1039/D6SC01168C
Article type: Edge Article
Submitted: 10 Feb 2026
Accepted: 17 Apr 2026
First published: 20 Apr 2026
This article is Open Access

All publication charges for this article have been paid for by the Royal Society of Chemistry

Download Citation

Chem. Sci., 2026, Advance Article

Permissions

Request permissions

A dual-mode large language model assistant for on-surface reactions via fine-tuning and retrieval-augmented generation

J. Xiang, Q. Huang, X. Zhang, T. Yang, Z. Zhu, C. Li, L. Cai and Q. Sun, Chem. Sci., 2026, Advance Article , DOI: 10.1039/D6SC01168C

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Chemical Science

A dual-mode large language model assistant for on-surface reactions via fine-tuning and retrieval-augmented generation

Abstract

Supplementary files

Article information

Download Citation

Permissions

A dual-mode large language model assistant for on-surface reactions via fine-tuning and retrieval-augmented generation

Social activity

Search articles by author

Spotlight

Advertisements