The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio
Renhao Wang, Haoran Geng, Tingle Li, Feishi Wang, Gopala Anumanchipalli, Trevor Darrell, Boyi Li, Pieter Abbeel, Jitendra Malik, Alexei A. Efros

TL;DR
This paper introduces MultiGen, a framework that combines generative models with physics simulators to enable multimodal (audio-visual) simulation for robots, facilitating zero-shot transfer of policies to real-world tasks like pouring.
Contribution
The paper presents MultiGen, a novel approach integrating large-scale generative models into simulators to synthesize realistic audio, advancing multimodal sim-to-real transfer for robotics.
Findings
Effective zero-shot transfer in robot pouring tasks
Realistic audiovisual simulation without real robot data
Bridging the multimodal sim-to-real gap using generative models
Abstract
Robots must integrate multiple sensory modalities to act effectively in the real world. Yet, learning such multimodal policies at scale remains challenging. Simulation offers a viable solution, but while vision has benefited from high-fidelity simulators, other modalities (e.g. sound) can be notoriously difficult to simulate. As a result, sim-to-real transfer has succeeded primarily in vision-based tasks, with multimodal transfer still largely unrealized. In this work, we tackle these challenges by introducing MultiGen, a framework that integrates large-scale generative models into traditional physics simulators, enabling multisensory simulation. We showcase our framework on the dynamic task of robot pouring, which inherently relies on multimodal feedback. By synthesizing realistic audio conditioned on simulation video, our method enables training on rich audiovisual trajectories --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Human Motion and Animation · Music Technology and Sound Studies
