Learning Robot Manipulation from Audio World Models

Fan Zhang; Michael Gienger

arXiv:2512.08405·cs.RO·December 10, 2025

Learning Robot Manipulation from Audio World Models

Fan Zhang, Michael Gienger

PDF

Open Access

TL;DR

This paper introduces a generative model that predicts future audio observations to improve robot manipulation tasks involving multimodal reasoning, especially in scenarios where audio cues are crucial for understanding physical interactions.

Contribution

The paper presents a novel generative latent flow matching model for anticipating future audio, enhancing robot decision-making in multimodal manipulation tasks.

Findings

01

Outperforms baseline methods in tasks requiring audio perception

02

Accurate future audio prediction improves manipulation success

03

Highlights importance of rhythmic pattern prediction in audio-based learning

Abstract

World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Multimodal Machine Learning Applications · Music and Audio Processing