SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets

Kshitij Mishra; Nils Lukas; and Salem Lahlou

arXiv:2601.17982·cs.CL·January 27, 2026

SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets

Kshitij Mishra, Nils Lukas, and Salem Lahlou

PDF

Open Access 1 Video

TL;DR

SD-E$^2$ is a reinforcement learning framework that enhances small language models' reasoning by explicitly optimizing semantic diversity in generated solutions, leading to improved accuracy and strategy variety under limited compute budgets.

Contribution

The paper introduces SD-E$^2$, a novel semantic diversity exploration method that improves reasoning in small language models by explicitly rewarding semantic novelty during training.

Findings

01

SD-E$^2$ outperforms baseline models on GSM8K, MedMCQA, and AIME benchmarks.

02

Models trained with SD-E$^2$ discover more semantically distinct strategies.

03

Semantic diversity reward improves reasoning accuracy and efficiency.

Abstract

Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E $^{2}$ ), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E $^{2}$ assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E $^{2}$ surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SD-E2: Semantic Exploration for Reasoning Under Token Budgets· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education