TL;DR
This paper introduces CASPO, a framework that improves reasoning reliability in large language models by aligning token confidence with logical correctness, enabling more efficient and trustworthy reasoning.
Contribution
CASPO is a novel method that calibrates token confidence without extra reward models, enhancing reasoning accuracy and efficiency across multiple benchmarks.
Findings
CASPO improves reasoning reliability across ten benchmarks.
It scales to Qwen3-8B-Base and outperforms tree-search baselines.
Code and a confidence-annotated dataset are publicly released.
Abstract
Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability. In this work, we introduce CASPO (Confidence-Aware Step-wise Preference Optimization), a framework that aligns token-level confidence with step-wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model. During inference, we propose Confidence-aware Thought (CaT), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligible O(V) latency. Experiments across ten benchmarks and multiple model families show that CASPO consistently improves reasoning reliability and inference efficiency. CASPO scales to Qwen3-8B-Base and surpasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
