Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs
Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

TL;DR
This paper introduces a new evaluation protocol for reasoning LLMs that decomposes token efficiency into meaningful components, enabling better understanding of their reasoning capabilities beyond simple accuracy scores.
Contribution
It proposes a trace-optional evaluation method that decomposes token efficiency into completion rate, correctness, and length, applicable even for closed models, and provides a detailed analysis framework.
Findings
Efficiency rankings are more stable than accuracy rankings across benchmarks.
Decomposition separates different failure modes like logic, truncation, and verbosity.
Evaluation artifacts and templates are released for transparent assessment.
Abstract
As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
