SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets
Kshitij Mishra, Nils Lukas, and Salem Lahlou

TL;DR
SD-E$^2$ is a reinforcement learning framework that enhances small language models' reasoning by explicitly optimizing semantic diversity in generated solutions, leading to improved accuracy and strategy variety under limited compute budgets.
Contribution
The paper introduces SD-E$^2$, a novel semantic diversity exploration method that improves reasoning in small language models by explicitly rewarding semantic novelty during training.
Findings
SD-E$^2$ outperforms baseline models on GSM8K, MedMCQA, and AIME benchmarks.
Models trained with SD-E$^2$ discover more semantically distinct strategies.
Semantic diversity reward improves reasoning accuracy and efficiency.
Abstract
Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education
