Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?
Mike Thelwall, Ehsan Mohammadi

TL;DR
This study evaluates various medium and small language models' ability to rate research articles, finding that models over 4 billion parameters perform well, especially with score averaging, while reasoning models offer no clear benefit.
Contribution
It demonstrates that smaller LLMs over 4 billion parameters can effectively assess research quality, and highlights score averaging as a simple, effective strategy, expanding the potential for offline research evaluation tools.
Findings
Medium-sized LLMs perform comparably to larger models in research quality assessment.
Score averaging from multiple queries improves rating reliability.
Reasoning prompts do not significantly enhance evaluation performance.
Abstract
Previous research has shown that journal article quality ratings from the cloud based Large Language Model (LLM) families ChatGPT and Gemini and the medium sized open weights LLM Gemma3 27b correlate moderately with expert research quality scores. This article assesses whether other medium sized LLMs, smaller LLMs, and reasoning models have similar abilities. This is tested with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1 on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. Few-shot and score averaging approaches are also evaluated. The results suggest that medium-sized LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Reasoning models did not have a clear advantage. Moreover, averaging scores from multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
