Counterfactual Self-Questioning for Stable Policy Optimization in Language Models

Mandar Parab

arXiv:2601.00885·cs.AI·January 6, 2026

Counterfactual Self-Questioning for Stable Policy Optimization in Language Models

Mandar Parab

PDF

Open Access

TL;DR

The paper introduces Counterfactual Self-Questioning, a novel framework enabling language models to self-improve by generating and evaluating their own critiques, leading to more stable training and better reasoning accuracy without external critics.

Contribution

It presents a new self-questioning approach that allows models to internally generate and assess counterfactual critiques, improving policy optimization and training stability.

Findings

01

Improves accuracy on mathematical reasoning benchmarks.

02

Enhances training stability for smaller models.

03

Enables scalable self-improvement without external critics.

Abstract

Recent work on language model self-improvement shows that models can refine their own reasoning through reflection, verification, debate, or self-generated rewards. However, most existing approaches rely on external critics, learned reward models, or ensemble sampling, which increases complexity and training instability. We propose Counterfactual Self-Questioning, a framework in which a single language model generates and evaluates counterfactual critiques of its own reasoning. The method produces an initial reasoning trace, formulates targeted questions that challenge potential failure points, and generates alternative reasoning trajectories that expose incorrect assumptions or invalid steps. These counterfactual trajectories provide structured relative feedback that can be directly used for policy optimization without auxiliary models. Experiments on multiple mathematical reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics