Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning
Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

TL;DR
This paper introduces R&B-EnCoRe, a self-supervised method that improves embodied reasoning in vision-language-action models by distilling relevant strategies from internet-scale knowledge without external rewards or human annotations.
Contribution
It presents a novel self-supervised framework that bootstraps embodied reasoning from large-scale data, overcoming reliance on manual templates and external supervision.
Findings
Achieved 28% increase in manipulation success rates.
Improved navigation scores by 101%.
Reduced collision rates by 21%.
Abstract
Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Action Observation and Synchronization
