OckBench: Measuring the Efficiency of LLM Reasoning
Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu

TL;DR
OckBench is a new benchmark that evaluates large language models on both reasoning accuracy and token efficiency, revealing significant redundancy and cost implications in current models.
Contribution
It introduces the first standardized benchmark measuring both accuracy and token efficiency in reasoning and coding tasks.
Findings
Token efficiency varies up to 5.0× among models with similar accuracy.
Current models exhibit significant redundancy, inflating costs and latency.
Benchmark results highlight the need for optimizing token usage in LLMs.
Abstract
Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Materials Science
