CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Paulo Cavalin; Cassia Sanctos; Marcelo Grave; Claudio Pinhanez; Yago Primerano

arXiv:2512.23711·cs.CL·January 1, 2026

CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Claudio Pinhanez, Yago Primerano

PDF

Open Access 3 Reviews

TL;DR

The paper introduces extsc{CAT}, a framework that visualizes and quantifies the relationship between accuracy and response consistency in LLMs under input variations, enhancing evaluation methods.

Contribution

It proposes CAR curves and the CORE index to analyze the trade-off between accuracy and consistency, providing a nuanced evaluation approach for LLMs.

Findings

01

Demonstrates extsc{CAT} across various LLMs and benchmarks.

02

Shows how accuracy varies with consistency requirements.

03

Provides tools for extended evaluation beyond multiple-choice tasks.

Abstract

We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

(1) Timely focus on consistency and robustness. \ The paper addresses an important and underexplored dimension of LLM evaluation: the stability of model predictions under semantically equivalent perturbations. This focus is especially timely given growing concerns about data leakage, contamination, and benchmark overfitting. Controlled perturbation and consistency analysis can help reveal whether a model has genuinely learned the underlying task or simply memorized patterns from training data. A

Weaknesses

(1) Clarity and organization of exposition (Figures 1–2; Section 3). \ The paper is difficult to follow on an initial read, largely because key figures and metrics are introduced out of order. Figures 1 and 2, showing synthetic CAR curves and the proposed CORE metric, appear in the introduction before any of the underlying metrics (MCQA+, MV, MCA, CORE) are defined. This sequencing makes the early figures challenging to interpret without the accompanying metric definitions, which could be introd

Reviewer 02Rating 4Confidence 3

Strengths

- The paper tackles an important and underexplored dimension of LLM evaluation, i.e. the trade-off between accuracy and consistency, which is critical for model trustworthiness and real-world deployment. - The definitions of MCA, CAR, and CORE are mathematically well specified and easy to reproduce. The CAR curves provide an intuitive visualization analogous to calibration or precision–recall plots. - The study includes diverse models (general-purpose and medical-domain) and benchmarks, illustra

Weaknesses

1. While the framework is well-executed, its methodological novelty is limited. MCA essentially generalizes the majority voting (MV) metric with a tunable threshold; CAR curves are simply MCA(c) plotted over varying thresholds, which is a standard evaluation pattern; CORE combines the area (AUC) and a DTW-based shape similarity, which feels like an engineering aggregation rather than a fundamentally new evaluation principle. 2. The use of DTW to measure curve similarity appears unnecessarily com

Reviewer 03Rating 2Confidence 4

Strengths

1. The CAR curves seem to capture an interesting dimension regarding stochastic LLM evaluation.

Weaknesses

1. The writing is not clear enough. 2. There are redundant parts, e.g., eq. 6 and 7, the evaluation prompt. 3. Overall, the contribution is suited for a full paper. 4. There is a lack of a bigger story behind the results. No clear problem or gap was stated that the paper solves. 5. A bit of odd notation, use an indicator function. The CAR curve is continuous, yet it is presented as discrete. 6. The experimental setup is unclear. Why consider llama-1? 7. It is not clear what information regular m

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification