FALCON: Actively Decoupled Visuomotor Policies for Loco-Manipulation with Foundation-Model-Based Coordination
Chengyang He, Ge Sun, Yue Bai, Junkai Lu, Jiadong Zhao, and Guillaume Sartoretti

TL;DR
FALCON introduces a modular visuomotor framework guided by foundation models, decoupling locomotion and manipulation for improved coordination, robustness, and generalization in loco-manipulation tasks.
Contribution
The paper presents a novel decoupled visuomotor policy framework using foundation models for coordination, with phase inference and contrastive loss for structured latent space.
Findings
Outperforms centralized and decentralized baselines.
Shows improved robustness to out-of-distribution scenarios.
Enables precise end-effector placement and navigation.
Abstract
We present FoundAtion-model-guided decoupled LoCO-maNipulation visuomotor policies (FALCON), a framework for loco-manipulation that combines modular diffusion policies with a vision-language foundation model as the coordinator. Our approach explicitly decouples locomotion and manipulation into two specialized visuomotor policies, allowing each subsystem to rely on its own observations. This mitigates the performance degradation that arise when a single policy is forced to fuse heterogeneous, potentially mismatched observations from locomotion and manipulation. Our key innovation lies in restoring coordination between these two independent policies through a vision-language foundation model, which encodes global observations and language instructions into a shared latent embedding conditioning both diffusion policies. On top of this backbone, we introduce a phase-progress head that uses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
