MC-CPO: Mastery-Conditioned Constrained Policy Optimization
Oluseyi Olukola, Nick Rahimi

TL;DR
This paper introduces MC-CPO, a reinforcement learning algorithm that incorporates pedagogical safety constraints to reduce reward hacking in adaptive tutoring systems, ensuring safer and more effective learning outcomes.
Contribution
The paper presents a novel mastery-conditioned constrained policy optimization algorithm that embeds pedagogical structure into the feasible action space for safer reinforcement learning in education.
Findings
MC-CPO satisfies safety constraints within tolerance across experiments.
It reduces safety costs compared to baseline methods.
It significantly lowers the Reward Hacking Severity Index (RHSI).
Abstract
Engagement-optimized adaptive tutoring systems may prioritize short-term behavioral signals over sustained learning outcomes, creating structural incentives for reward hacking in reinforcement learning policies. We formalize this challenge as a constrained Markov decision process (CMDP) with mastery-conditioned feasibility, in which pedagogical safety constraints dynamically restrict admissible actions according to learner mastery and prerequisite structure. We introduce Mastery-Conditioned Constrained Policy Optimization (MC-CPO), a two-timescale primal-dual algorithm that integrates structural action masking with constrained policy optimization. In the tabular regime, we establish feasibility preservation and convergence to stationary feasible points under standard stochastic approximation conditions and derive a safety gap result showing that optimization within the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
