MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling
Diwei Huang, Kunyang Lin, Peihao Chen, Qing Du, Mingkui Tan

TL;DR
This paper introduces a map-guided framework for few-shot audio-visual acoustics modeling, leveraging semantic feature maps and transformer-based encoding to accurately synthesize room impulse responses with limited data.
Contribution
It proposes a novel map-guided approach that constructs semantic feature maps and employs diffusion and transformer models for improved acoustic scene understanding.
Findings
Effective in synthesizing RIR with limited observations
Outperforms baseline methods on Matterport3D and Replica datasets
Demonstrates the importance of semantic maps in acoustic modeling
Abstract
Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and maps provide explicit structural regularities of sound propagation, which are valuable for modeling environment acoustics. We thus extract pixel-wise semantic features derived from observations and project them into a top-down map, namely the **observation semantic map**. This map contains the relative positional information among points and the semantic feature information associated with each point. Yet, limited information extracted by few-shot observations on the map is not sufficient for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
