Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

Pranaya Jajoo; Harshit Sikchi; Siddhant Agarwal; Amy Zhang; Scott Niekum; Martha White

arXiv:2603.15857·cs.AI·March 18, 2026

Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

Pranaya Jajoo, Harshit Sikchi, Siddhant Agarwal, Amy Zhang, Scott Niekum, Martha White

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RLDP, a simple regularization method for latent state prediction that maintains feature diversity and outperforms complex methods in zero-shot reinforcement learning, especially in low-coverage scenarios.

Contribution

RLDP demonstrates that a straightforward orthogonality regularization can match or surpass complex representation learning methods for zero-shot RL.

Findings

01

RLDP matches or exceeds state-of-the-art in zero-shot RL.

02

RLDP performs well in low-coverage data scenarios.

03

Complex objectives are not necessary for effective latent state prediction.

Abstract

Behavioral Foundation Models (BFMs) produce agents with the capability to adapt to any unknown reward or task. These methods, however, are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features, making the choice of state features crucial to the expressivity of the BFM. As a result, BFMs are trained using a variety of complex objectives and require sufficient dataset coverage, to train task-useful spanning features. In this work, we examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span. We propose an approach, Regularized Latent…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- Their proposed method is well-motivated both theoretically and empirically, and the authors present strong experimental evidence for their framework relative to other works in its area. - The experiments in their paper suggests that their proposed method can produce competitive results in the zero-shot RL setting, all the while being both simple to understand and implement compared to existing methods. Thus, this work is an important entry point for new researchers in the field of BFMs for zer

Weaknesses

- It would be nice to further contextualize RLDP with the existing representation learning methods, by either providing the exact objectives of the baselines used in the paper(Laplace, FB, HILP, PSM) or being more explicit about the different assumptions used. For instance, the authors also suggest that their method is simpler in implementation and lacks some assumptions made by prior methods such as a prior class of policies. Showing the objectives used in the other works would clarify these di

Reviewer 02Rating 6Confidence 3

Strengths

1. Comprehensive experimental settings, including both online and offline environments. 2. Theoretical guarantees that connect the latent training objective to successor-measure consistency. 3. Rich intuition and ablation studies that help interpret how RLDP improves representation quality.

Weaknesses

There are no major flaws, but several aspects could be improved for clarity and consistency. 1. Structure and exposition could be better organized: - The paper does not clearly specify which parameters are optimized for each loss. The training process is only briefly introduced around line 295, and it’s unclear which components receive gradients. A concise summary of the full optimization pipeline (e.g., which modules update under each loss) would make the method section more

Reviewer 03Rating 6Confidence 2

Strengths

The paper makes a simple yet effective case for introducing latent dynamics into zero-shot offline RL. RLDP achieves performance on par with or exceeding that of more complex methods, demonstrating that simple latent dynamics prediction can yield strong generalization. Policy-independent formulation avoids instability from Bellman backups, leading to more reliable learning in challenging or low-coverage environments.

Weaknesses

Evaluation is performed on simulated continuous control benchmarks. Real-world data or transfer is not evaluated. Empirical results show comparable but not universally superior performance, suggesting that the method’s advantages depend on task characteristics and data diversity.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning in Healthcare · Domain Adaptation and Few-Shot Learning