Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning
Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine

TL;DR
This paper introduces Posterior Behavioral Cloning (PostBC), a novel pretraining method that models the posterior distribution of demonstrator behavior, leading to more effective RL finetuning in robotic tasks.
Contribution
The paper proposes modeling the posterior distribution of demonstrations instead of exact imitation, ensuring better coverage and improved RL finetuning performance.
Findings
PostBC guarantees coverage over demonstrator actions.
PostBC outperforms standard behavioral cloning in RL finetuning.
Effective in robotic control benchmarks and real-world tasks.
Abstract
Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator --…
Peer Reviews
Decision·Submitted to ICLR 2026
- The idea of leveraging a posterior distribution over actions for better RL finetuning efficiency is conceptually novel and practically meaningful. - Clearly identifies and addresses a core limitation of standard BC in the context of RL finetuning. - Provides a well-developed theoretical framework with sound reasoning about action coverage and RL-finetuning potential.
- If I understand correctly, the authors perturb the training actions with Gaussian noise, train multiple policies on these perturbed datasets, and compute the covariance of their predicted actions to approximate a posterior distribution. This approach seems somewhat ad hoc. If the base models have sufficient capacity to memorize the data, the estimated covariance would simply reflect the injected noise, effectively collapsing back to the σ-BC baseline. In classical bootstrap ensemble methods, o
I like the emphasis on providing theoretical foundations to justify the model's algorithmic choices. Showing sample bounds and the effects on policy cumulative reward as a function of the number of actions, states, and timesteps provides better justification for the problem. It is more insightful than just proposing an algorithm. We like the author's solution because of its ease of implementation and potential applications in continuous control.
Technically, the paper has not violated the page limit, but it ends abruptly on page 9 without a clear conclusion. This ending indicates that more time is needed to condense the current draft to fit the page limit, without the additional page in the rebuttal. Figures 2 - 5 are squished together as a byproduct of this. It isn't easy to see the content of Figure 2, and the legend in Figure 3 dominates one of the presented plots. Discussion across many sections can likely be compressed to address
The paper has a theoretical contribution on how the proposed posterior behavior cloning could be provably better than the standard behavior cloning, in terms of the action coverage. Based on the findings, the paper proposes a simple instantiation of the posterior behavior cloning for continuous control settings.
While the paper explicitly states that “there do not exist any approaches which aim to pretrain policies with a BC-like objective on demonstration data, with the aim of obtaining an initialization that is an effective starting point of finetuning”, this is exactly what the typical meta-learning for supervised learning does, and behavior cloning is just an example of supervised learning (the outer loop of gradient-based meta-learning corresponds to the “posterior behavior cloning”). Even if we na
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Multimodal Machine Learning Applications
