Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Milan Ganai; Katie Luo; Jonas Frey; Clark Barrett; Marco Pavone

arXiv:2602.08167·cs.RO·May 19, 2026

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

PDF

TL;DR

This paper introduces R&B-EnCoRe, a self-supervised method that improves embodied reasoning in vision-language-action models by distilling relevant strategies from internet-scale knowledge without external rewards or human annotations.

Contribution

It presents a novel self-supervised framework that bootstraps embodied reasoning from large-scale data, overcoming reliance on manual templates and external supervision.

Findings

01

Achieved 28% increase in manipulation success rates.

02

Improved navigation scores by 101%.

03

Reduced collision rates by 21%.

Abstract

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Action Observation and Synchronization