Can Smaller Large Language Models Evaluate Research Quality?

Mike Thelwall

arXiv:2508.07196·cs.DL·August 12, 2025

Can Smaller Large Language Models Evaluate Research Quality?

Mike Thelwall

PDF

Open Access

TL;DR

This study evaluates a smaller, downloadable LLM's ability to assess research quality, finding it correlates positively with expert scores, though less strongly than larger models, indicating smaller models can be useful for research evaluation tasks.

Contribution

It demonstrates that a smaller LLM can effectively estimate research quality scores, challenging the notion that only large models possess this capability.

Findings

01

Gemma-3-27b-it correlates positively with expert scores across fields.

02

Its correlation strength is 83.8% of ChatGPT 4o and 94.7% of ChatGPT 4o-mini.

03

Smaller LLMs do not significantly improve with score averaging or repetition.

Abstract

Although both Google Gemini (1.5 Flash) and ChatGPT (4o and 4o-mini) give research quality evaluation scores that correlate positively with expert scores in nearly all fields, and more strongly that citations in most, it is not known whether this is true for smaller Large Language Models (LLMs). In response, this article assesses Google's Gemma-3-27b-it, a downloadable LLM (60Gb). The results for 104,187 articles show that Gemma-3-27b-it scores correlate positively with an expert research quality score proxy for all 34 Units of Assessment (broad fields) from the UK Research Excellence Framework 2021. The Gemma-3-27b-it correlations have 83.8% of the strength of ChatGPT 4o and 94.7% of the strength of ChatGPT 4o-mini correlations. Differently from the two larger LLMs, the Gemma-3-27b-it correlations do not increase substantially when the scores are averaged across five repetitions, its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Radiomics and Machine Learning in Medical Imaging