Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

Oluseyi Olukola; Nick Rahimi

arXiv:2604.04237·cs.AI·April 7, 2026

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

Oluseyi Olukola, Nick Rahimi

PDF

TL;DR

This paper formalizes pedagogical safety in educational reinforcement learning, introduces a safety framework and severity index, and evaluates methods to reduce reward hacking in AI tutoring systems through simulations.

Contribution

It proposes a four-layer pedagogical safety model and the RHSI metric, and demonstrates how different safety strategies impact reward hacking in simulated educational RL environments.

Findings

01

Engagement-optimized agents over-selected high-engagement actions without learning gains.

02

Multi-objective reward formulations reduced reward hacking but did not eliminate it.

03

Constrained architectures with prerequisite enforcement significantly lowered reward hacking.

Abstract

Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety for educational RL comprising structural, progress, behavioral, and alignment safety and propose the Reward Hacking Severity Index (RHSI) to quantify misalignment between proxy rewards and genuine learning. We evaluate the framework in a controlled simulation of an AI tutoring environment with 120 sessions across four conditions and three learner profiles, totaling 18{,}000 interactions. Results show that an engagement-optimized agent systematically over-selected a high-engagement action with no direct mastery gain, producing strong measured performance but limited learning progress. A multi-objective reward formulation reduced this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.