Learning Robot Manipulation from Audio World Models
Fan Zhang, Michael Gienger

TL;DR
This paper introduces a generative model that predicts future audio observations to improve robot manipulation tasks involving multimodal reasoning, especially in scenarios where audio cues are crucial for understanding physical interactions.
Contribution
The paper presents a novel generative latent flow matching model for anticipating future audio, enhancing robot decision-making in multimodal manipulation tasks.
Findings
Outperforms baseline methods in tasks requiring audio perception
Accurate future audio prediction improves manipulation success
Highlights importance of rhythmic pattern prediction in audio-based learning
Abstract
World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Multimodal Machine Learning Applications · Music and Audio Processing
