EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

Junquan Huang; Haotian Wu; Yubo Gao; Yibo Yan; Junyan Zhang; Yonghua Hei; Song Dai; Jie Zhang; Puay Siew Tan; Xuming Hu

arXiv:2511.10201·cs.CL·November 14, 2025

EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

Junquan Huang, Haotian Wu, Yubo Gao, Yibo Yan, Junyan Zhang, Yonghua Hei, Song Dai, Jie Zhang, Puay Siew Tan, Xuming Hu

PDF

Open Access

TL;DR

EffiReason-Bench introduces a comprehensive benchmark for evaluating efficient reasoning in large language models, enabling standardized, cross-paradigm assessment of various methods across multiple datasets and model scales.

Contribution

The paper presents a unified benchmark with verified reasoning annotations and a new E3-Score metric for fair, stable comparison of efficient reasoning methods in LLMs.

Findings

01

No single method dominates across all tasks and models.

02

Optimal reasoning strategies vary with model size and task complexity.

03

The E3-Score provides a reliable evaluation metric for efficiency and accuracy trade-offs.

Abstract

Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Explainable Artificial Intelligence (XAI)