On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation
Alexander Galozy

TL;DR
This paper investigates how internal information conditioning in reinforcement learning policies affects their behavioural consistency under policy transformations, revealing structural limitations and contraction properties.
Contribution
It formalizes behavioural dependency and behavioural distance, proving non-closure under convex aggregation and contraction under convex combination, with conditions for gradient descent effects.
Findings
Behavioural dependency set is not convex.
Behavioural distance contracts under convex combination.
Gradient ascent can decrease behavioural distance under certain conditions.
Abstract
Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of -behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Game Theory and Applications · Opinion Dynamics and Social Influence
