Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar

TL;DR
This paper introduces reusability and verifiability as new metrics to evaluate the quality of Chain-of-Thought reasoning in multi-agent LLM systems, revealing insights beyond traditional accuracy measures.
Contribution
It proposes a Thinker-Executor framework and novel evaluation metrics, highlighting limitations of current accuracy-based assessments for reasoning quality.
Findings
Reusability and verifiability do not correlate with accuracy.
Specialized reasoning models do not outperform general-purpose LLMs in these metrics.
Current leaderboards may overlook reasoning quality aspects.
Abstract
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Decision-Making and Behavioral Economics · Ethics and Social Impacts of AI
