Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs
Maria-Florina Balcan, Avrim Blum, Kiriaki Fragkia, Zhiyuan Li, Dravyansh Sharma

TL;DR
This paper introduces an online learning framework for chain-of-thought verifiers in large language models, addressing the trade-offs between soundness and completeness to improve reasoning accuracy and safety.
Contribution
It develops novel theoretical bounds and algorithms for learning verifiers that balance error types, enhancing reasoning verification and generator performance.
Findings
Proposes a new online learning framework for verifiers.
Provides optimal algorithms for mistake trade-offs.
Demonstrates improved reasoning accuracy with learned verifiers.
Abstract
Large Language Models (LLMs) with chain-of-thought generation have demonstrated great potential for solving complex reasoning and planning tasks. However, the output of current LLMs is not fully reliable and needs careful verification. Even if LLMs get more accurate over time, learned verifiers can help increase trust, enforce safety constraints, and ensure alignment with personal preferences. A major challenge in learning verifiers, however, especially when their output will be used by the generator to improve its reasoning, is that the feedback loop between generator and verifier may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness errors (failure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
