Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

Paulo Cavalin; Cassia Sanctos; Marcelo Grave; Claudio Pinhanez; Yago Primerano

arXiv:2511.21860·cs.CL·December 1, 2025

Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Claudio Pinhanez, Yago Primerano

PDF

Open Access

TL;DR

This paper introduces the CoRA metric, which enhances the reliability of LLM scores on multiple choice benchmarks by evaluating response consistency through synthetic question alterations, leading to more accurate assessments.

Contribution

The paper proposes the CoRA metric that adjusts LLM scores based on response consistency, improving the reliability of benchmark evaluations.

Findings

01

LLMs can have high MCQA scores but low response consistency.

02

CoRA effectively scales down scores of inconsistent models.

03

Response consistency correlates with true model reliability.

Abstract

In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Natural Language Processing Techniques