Implicit and Explicit Research Quality Score Probabilities from ChatGPT
Mike Thelwall, Yunhan Yang

TL;DR
This study evaluates how ChatGPT's internal probability estimates can be used to assess research article quality, finding that token probability-based scoring offers a cost-effective and accurate ranking method aligned with human quality judgments.
Contribution
It introduces and tests novel strategies using ChatGPT's token probabilities for research quality assessment, demonstrating improved accuracy and cost-effectiveness over explicit likelihood requests.
Findings
Token probability-based scores correlate better with human judgments.
Explicit likelihood requests decrease scoring accuracy.
Token probabilities provide a reliable, cheaper ranking method.
Abstract
The large language model (LLM) ChatGPT's quality scores for journal articles correlate more strongly with human judgements than some citation-based indicators in most fields. Averaging multiple ChatGPT scores improves the results, apparently leveraging its internal probability model. To leverage these probabilities, this article tests two novel strategies: requesting percentage likelihoods for scores and extracting the probabilities of alternative tokens in the responses. The probability estimates were then used to calculate weighted average scores. Both strategies were evaluated with five iterations of ChatGPT 4o-mini on 96,800 articles submitted to the UK Research Excellence Framework (REF) 2021, using departmental average REF2021 quality scores as a proxy for article quality. The data was analysed separately for each of the 34 field-based REF Units of Assessment. For the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging · Meta-analysis and systematic reviews
