SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
Khanh Binh Nguyen, Chae Jung Park

TL;DR
SOUPLE introduces learnable prompt contexts to improve audio-visual localization and segmentation by better capturing semantic correspondence between modalities, outperforming fixed prompt methods on multiple datasets.
Contribution
The paper proposes a novel learnable prompt learning approach that enhances audio-visual localization and segmentation by incorporating visual features into context tokens.
Findings
Improved localization and segmentation performance on VGGSound, SoundNet, and AVSBench datasets.
Learnable prompts effectively bridge semantic gaps between audio and visual inputs.
Outperforms fixed prompt methods in multimodal localization tasks.
Abstract
Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
