Large language models in materials science: assessing RAG evaluation frameworks through graphene synthesis

Zen Han Cho; Matthew Osvaldo; Sayan Doloi; Maloy Das; Jun Ci Goh; Bo Sheng Tan; Jiali Wang; Yujia Li; Xingchi Xiao; Amrita Joshi; Leonard Wei Tat Ng

doi:10.1039/D5RA09726F

Large language models in materials science: assessing RAG evaluation frameworks through graphene synthesis

Zen Han Cho,

^a Matthew Osvaldo,

^a Sayan Doloi,

^a Maloy Das,

^a Jun Ci Goh,

^a Bo Sheng Tan,

^a Jiali Wang,^a Yujia Li,

^a Xingchi Xiao,^a Amrita Joshi

^a and Leonard Wei Tat Ng

*^a

Author affiliations

* Corresponding authors

^a School of Materials Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore
E-mail: leonard.ngwt@ntu.edu.sg

Abstract

Retrieval-Augmented Generation (RAG) systems increasingly support scientific research, yet evaluating their performance in specialized domains remains challenging due to the technical complexity and precision requirements of scientific knowledge. This study presents the first systematic analysis of automated evaluation frameworks for scientific RAG systems, using graphene synthesis in materials science as a representative case study. We develop a comprehensive evaluation protocol comparing four assessment approaches: RAGAS (an automated RAG evaluation framework), BERTScore, LLM-as-a-Judge, and expert human evaluation across 20 domain-specific questions. Our analysis of automated evaluators reveals that BERTScore lacks the interpretability and score sensitivity required to distinguish meaningful performance difference, while LLM-as-a-Judge failed to capture retrieval augmentation benefits. In contrast, RAGAS successfully captured relative performance improvements from retrieval augmentation, identifying performance gains in RAG-augmented systems (0.52-point improvement for Gemini, 1.03-point for Qwen on a 10-point scale), and demonstrating particular sensitivity to retrieval benefits in smaller, open-source models. However, it still exhibits fundamental limitations in absolute score interpretation for scientific content. These findings establish methodological guidelines for scientific RAG evaluation and highlight critical considerations for researchers deploying AI systems in specialized domains.

RSC Advances

Large language models in materials science: assessing RAG evaluation frameworks through graphene synthesis

Abstract

Supplementary files

Transparent peer review

Article information

Download Citation

Permissions

Large language models in materials science: assessing RAG evaluation frameworks through graphene synthesis

Social activity

Search articles by author

Spotlight

Advertisements