Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li; Baihe Huang; Xiaobin Zhuang; Dongya Jia; Jiawei Chen; Yuping Wang; Zhuo Chen; Gopala Anumanchipalli; Yuxuan Wang

arXiv:2506.04214·cs.CV·June 5, 2025

Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

PDF

Open Access

TL;DR

This paper introduces an interactive object-aware image-to-audio generation model that enables users to generate sounds for specific objects in images, leveraging object-centric learning and attention mechanisms for improved alignment.

Contribution

The paper presents a novel model integrating object-centric learning with diffusion models, allowing interactive, object-level sound generation grounded in image segmentation.

Findings

01

Outperforms baselines in object-sound alignment

02

Employs attention mechanism that approximates segmentation masks

03

Enables user interaction for targeted sound generation

Abstract

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need · Diffusion