Counterfactual Self-Questioning for Stable Policy Optimization in Language Models
Mandar Parab

TL;DR
The paper introduces Counterfactual Self-Questioning, a novel framework enabling language models to self-improve by generating and evaluating their own critiques, leading to more stable training and better reasoning accuracy without external critics.
Contribution
It presents a new self-questioning approach that allows models to internally generate and assess counterfactual critiques, improving policy optimization and training stability.
Findings
Improves accuracy on mathematical reasoning benchmarks.
Enhances training stability for smaller models.
Enables scalable self-improvement without external critics.
Abstract
Recent work on language model self-improvement shows that models can refine their own reasoning through reflection, verification, debate, or self-generated rewards. However, most existing approaches rely on external critics, learned reward models, or ensemble sampling, which increases complexity and training instability. We propose Counterfactual Self-Questioning, a framework in which a single language model generates and evaluates counterfactual critiques of its own reasoning. The method produces an initial reasoning trace, formulates targeted questions that challenge potential failure points, and generates alternative reasoning trajectories that expose incorrect assumptions or invalid steps. These counterfactual trajectories provide structured relative feedback that can be directly used for policy optimization without auxiliary models. Experiments on multiple mathematical reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics
