SuS: Strategy-aware Surprise for Intrinsic Exploration
Mark Kashirskiy, Ilya Makarov

TL;DR
SuS introduces a novel intrinsic motivation framework for reinforcement learning that combines strategy stability and surprise signals to enhance exploration, especially in mathematical reasoning tasks with large language models.
Contribution
The paper presents Strategy-aware Surprise (SuS), a new approach integrating strategy stability and surprise for improved exploration in RL, validated on reasoning tasks with significant performance gains.
Findings
Achieves 17.4% improvement in Pass@1
Achieves 26.4% improvement in Pass@5
Maintains higher strategy diversity during training
Abstract
We propose Strategy-aware Surprise (SuS), a novel intrinsic motivation framework that uses pre-post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity-driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent's current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Advanced Bandit Algorithms Research
