Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning
Enze Pan

TL;DR
Tape is a benchmark designed to evaluate reinforcement learning algorithms' ability to generalize across latent rule-shifts in dynamics, providing a controlled environment to diagnose robustness and brittleness.
Contribution
This paper introduces Tape, a novel controlled benchmark isolating latent rule-shift in dynamics for evaluating RL generalization and robustness.
Findings
RL algorithms show a consistent drop in performance from ID to OOD settings.
Fragility to latent-law changes exists even in simple deterministic 1D environments.
Tape enables detailed diagnostics of policy robustness and adaptation to rule shifts.
Abstract
Out-of-distribution generalization in reinforcement learning is hard to diagnose when benchmark shifts mix dynamics, observations, goals, and rewards. We address this with Tape, a controlled benchmark that isolates latent rule-shift in dynamics while keeping the observation-action interface fixed. The protocol combines deterministic splits, 20-seed replication, bootstrap uncertainty reporting, and continuous metrics for sparse-success regimes. Across baseline families, we find a consistent ID-to-OOD drop and strong heterogeneity across stable/periodic/chaotic rules. Importantly, this fragility appears even in an intentionally simple 1D deterministic setting, suggesting that many current RL algorithms remain brittle to latent-law changes under minimal confounds. To calibrate strict success, we report a protocol-matched true-dynamics random-shooting reference (p_oracle is almost 0.187)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
