Issue 13, 2026, Issue in Progress

Large language models in materials science: assessing RAG evaluation frameworks through graphene synthesis

Abstract

Retrieval-Augmented Generation (RAG) systems increasingly support scientific research, yet evaluating their performance in specialized domains remains challenging due to the technical complexity and precision requirements of scientific knowledge. This study presents the first systematic analysis of automated evaluation frameworks for scientific RAG systems, using graphene synthesis in materials science as a representative case study. We develop a comprehensive evaluation protocol comparing four assessment approaches: RAGAS (an automated RAG evaluation framework), BERTScore, LLM-as-a-Judge, and expert human evaluation across 20 domain-specific questions. Our analysis of automated evaluators reveals that BERTScore lacks the interpretability and score sensitivity required to distinguish meaningful performance difference, while LLM-as-a-Judge failed to capture retrieval augmentation benefits. In contrast, RAGAS successfully captured relative performance improvements from retrieval augmentation, identifying performance gains in RAG-augmented systems (0.52-point improvement for Gemini, 1.03-point for Qwen on a 10-point scale), and demonstrating particular sensitivity to retrieval benefits in smaller, open-source models. However, it still exhibits fundamental limitations in absolute score interpretation for scientific content. These findings establish methodological guidelines for scientific RAG evaluation and highlight critical considerations for researchers deploying AI systems in specialized domains.

Graphical abstract: Large language models in materials science: assessing RAG evaluation frameworks through graphene synthesis

Supplementary files

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article.

View this article’s peer review history

Article information

Article type
Paper
Submitted
16 Dec 2025
Accepted
04 Feb 2026
First published
27 Feb 2026
This article is Open Access
Creative Commons BY license

RSC Adv., 2026,16, 11306-11313

Large language models in materials science: assessing RAG evaluation frameworks through graphene synthesis

Z. H. Cho, M. Osvaldo, S. Doloi, M. Das, J. C. Goh, B. S. Tan, J. Wang, Y. Li, X. Xiao, A. Joshi and L. W. T. Ng, RSC Adv., 2026, 16, 11306 DOI: 10.1039/D5RA09726F

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements