Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding
Joshua Jones, Oier Mees, Carmelo Sferrazza, Kyle Stachowicz, Pieter, Abbeel, Sergey Levine

TL;DR
FuSe is a novel method that fine-tunes generalist robot policies across multiple sensory modalities using language grounding, enabling complex multimodal reasoning and manipulation in real-world scenarios with significant success improvements.
Contribution
Introduces FuSe, a new approach combining contrastive and language grounding losses to adapt generalist robot policies to heterogeneous sensors without large datasets.
Findings
Enables zero-shot multimodal reasoning in robot manipulation
Increases success rates by over 20% in real-world tests
Works with diverse generalist policies including diffusion-based and VLA models
Abstract
Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
