# Large language models in materials science: assessing RAG evaluation frameworks through graphene synthesis

**Authors:** Zen Han Cho, Matthew Osvaldo, Sayan Doloi, Maloy Das, Jun Ci Goh, Bo Sheng Tan, Jiali Wang, Yujia Li, Xingchi Xiao, Amrita Joshi, Leonard Wei Tat Ng

PMC · DOI: 10.1039/d5ra09726f · RSC Advances · 2026-02-27

## TL;DR

This paper evaluates automated methods for assessing AI systems used in scientific research, focusing on graphene synthesis in materials science.

## Contribution

The study introduces a systematic evaluation protocol for scientific RAG systems and compares the effectiveness of different automated evaluation approaches.

## Key findings

- BERTScore lacks interpretability and sensitivity for scientific RAG evaluation.
- RAGAS successfully captures performance improvements from retrieval augmentation in scientific RAG systems.
- LLM-as-a-Judge fails to capture retrieval benefits in scientific contexts.

## Abstract

Retrieval-Augmented Generation (RAG) systems increasingly support scientific research, yet evaluating their performance in specialized domains remains challenging due to the technical complexity and precision requirements of scientific knowledge. This study presents the first systematic analysis of automated evaluation frameworks for scientific RAG systems, using graphene synthesis in materials science as a representative case study. We develop a comprehensive evaluation protocol comparing four assessment approaches: RAGAS (an automated RAG evaluation framework), BERTScore, LLM-as-a-Judge, and expert human evaluation across 20 domain-specific questions. Our analysis of automated evaluators reveals that BERTScore lacks the interpretability and score sensitivity required to distinguish meaningful performance difference, while LLM-as-a-Judge failed to capture retrieval augmentation benefits. In contrast, RAGAS successfully captured relative performance improvements from retrieval augmentation, identifying performance gains in RAG-augmented systems (0.52-point improvement for Gemini, 1.03-point for Qwen on a 10-point scale), and demonstrating particular sensitivity to retrieval benefits in smaller, open-source models. However, it still exhibits fundamental limitations in absolute score interpretation for scientific content. These findings establish methodological guidelines for scientific RAG evaluation and highlight critical considerations for researchers deploying AI systems in specialized domains.

Automated evaluation of RAG systems is increasingly used in scientific applications, yet its reliability remains unclear. Using graphene synthesis as a case study, this work systematically benchmarks three automated evaluators.

## Full-text entities

- **Diseases:** LLMs (MESH:D007806)
- **Chemicals:** PMMA (MESH:D019904), graphene (MESH:D006108), GPT-3.5 (-), graphene oxide (MESH:C000628730)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12947896/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12947896/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12947896/full.md

---
Source: https://tomesphere.com/paper/PMC12947896