Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Germ\'an Kruszewski; Pierre Erbacher; Jos Rozen; Marc Dymetman

arXiv:2512.05962·cs.LG·March 9, 2026

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Germ\'an Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a filtering-based training method for large language models that balances reasoning accuracy and diversity by controlling the divergence used during training, outperforming reinforcement learning approaches.

Contribution

It proposes a novel approach starting from an explicit target distribution and uses the $oldsymbol{ extalpha}$-divergence to balance precision and diversity in reasoning tasks.

Findings

01

Achieves state-of-the-art coverage-precision trade-off on theorem proving benchmark.

02

Outperforms prior methods in coverage while maintaining high precision.

03

Demonstrates the effectiveness of filtering and divergence control in LLM training.

Abstract

Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $α$ -divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Paper is well-written and easy to follow, with enough background explanation - A simple approach with meaningful peformance gain - Extensive ablations

Weaknesses

- Novelty-wise, at the end of the day, the methodology is precisely f-DPG of Go et al. (2023), with f-divergence replaced with Amari's $\alpha$-divergence. Also, the target distribution $p_c$ is taken from Khalifa et al. (2021). But, I also concur that the simplicity of the proposed method overshadows the "lack" of novelty. - The writing can be made a bit clearer. As this is a mixture of prior works, a clearer separation of the authors' contributions and prior works in Section 3 would be helpful

Reviewer 02Rating 4Confidence 3

Strengths

1. The theoretical analysis is rigorous, clearly attributing the diversity loss in RL-tuned LLMs to the mode-seeking behavior induced by the Reverse KL divergence. 2. The paper is well-structured and addresses a meaningful research question, highlighting the significance of the accuracy–diversity trade-off in RL-aligned LLMs.

Weaknesses

The main weakness lies in the experimental evaluation. 1. The authors focus on RL algorithms in LLMs, yet RL-specific experiments are limited. It is essential to include baselines covering algorithms such as PPO and REINFORCE. 2. Additionally, evaluations are confined to Lean. It would be valuable to incorporate 1–2 additional tasks (e.g., code generation or natural language mathematical reasoning) to assess the framework’s general applicability beyond formal theorem proving. 3. Why doesn't L

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper formalizes the filtered target distribution, pc, and proves that maximizing the usual RLVR objective equals minimizing the reverse KL to a softened filter. This clarifies why RLVR becomes mode seeking and why diversity collapses in practice. The argument is simple but convincing. 2. The method unifies rejection sampling fine-tuning at one end and RLVR-style training at the other end, and allows smooth interpolation between mass covering and mode-seeking regimes. This aligns with pri

Weaknesses

1. Evaluation is quite narrow. All main experiments are with a single base model family and a 7B scale, which is my biggest concern. 2. The GRPO baselines use β=0 by default and add a single high KL setting. However, recent works show that reward shaping like rewarding the unlikely or optimizing pass@k can rescue coverage. A stronger baseline suite with these variants tuned as carefully as Amari DPG would make the frontier result harder to question. [1] Rewarding the Unlikely: Lifting GRPO Beyon

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)