Sounding that Object: Interactive Object-Aware Image to Audio Generation
Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

TL;DR
This paper introduces an interactive object-aware image-to-audio generation model that enables users to generate sounds for specific objects in images, leveraging object-centric learning and attention mechanisms for improved alignment.
Contribution
The paper presents a novel model integrating object-centric learning with diffusion models, allowing interactive, object-level sound generation grounded in image segmentation.
Findings
Outperforms baselines in object-sound alignment
Employs attention mechanism that approximates segmentation masks
Enables user interaction for targeted sound generation
Abstract
Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need · Diffusion
