Prompt perturbation and fraction facilitation sometimes strengthen Large Language Model scores
Mike Thelwall

TL;DR
This study investigates how prompt design strategies, including perturbations and fractional scoring, can improve Large Language Models' ability to evaluate research quality, revealing model-specific sensitivities and effective averaging techniques.
Contribution
It demonstrates that prompt variations, averaging, and fractional scoring can enhance LLM scoring accuracy, providing practical strategies for prompt engineering in evaluation tasks.
Findings
Prompt variations improve scoring consistency
Averaging scores from similar prompts enhances reliability
Allowing fractional scores reveals model certainty levels
Abstract
Large Language Models (LLMs) can be tasked with scoring texts according to pre-defined criteria and on a defined scale, but there is no recognised optimal prompting strategy for this. This article focuses on the task of LLMs scoring journal articles for research quality on a four-point scale, testing how user prompt design can enhance this ability. Based primarily on 1.7 million Gemma3 27b queries for 2780 health and life science articles with 58 similar prompts, the results show that improvements can be obtained by (a) testing semantically equivalent prompt variations, (b) averaging scores from semantically equivalent prompts, (c) specifying that fractional scores are allowed, and possibly also (d) not drawing attention to the input being partial. Whilst (a) and (d) suggests that models can be sensitive to how a task is phrased, (b) and (c) suggest that strategies to leverage more of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods · Topic Modeling
