SuS: Strategy-aware Surprise for Intrinsic Exploration

Mark Kashirskiy; Ilya Makarov

arXiv:2601.10349·cs.LG·January 16, 2026

SuS: Strategy-aware Surprise for Intrinsic Exploration

Mark Kashirskiy, Ilya Makarov

PDF

Open Access 2 Models

TL;DR

SuS introduces a novel intrinsic motivation framework for reinforcement learning that combines strategy stability and surprise signals to enhance exploration, especially in mathematical reasoning tasks with large language models.

Contribution

The paper presents Strategy-aware Surprise (SuS), a new approach integrating strategy stability and surprise for improved exploration in RL, validated on reasoning tasks with significant performance gains.

Findings

01

Achieves 17.4% improvement in Pass@1

02

Achieves 26.4% improvement in Pass@5

03

Maintains higher strategy diversity during training

Abstract

We propose Strategy-aware Surprise (SuS), a novel intrinsic motivation framework that uses pre-post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity-driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent's current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Advanced Bandit Algorithms Research