SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

Khanh Binh Nguyen; Chae Jung Park

arXiv:2603.22732·cs.CV·March 25, 2026

SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

Khanh Binh Nguyen, Chae Jung Park

PDF

Open Access

TL;DR

SOUPLE introduces learnable prompt contexts to improve audio-visual localization and segmentation by better capturing semantic correspondence between modalities, outperforming fixed prompt methods on multiple datasets.

Contribution

The paper proposes a novel learnable prompt learning approach that enhances audio-visual localization and segmentation by incorporating visual features into context tokens.

Findings

01

Improved localization and segmentation performance on VGGSound, SoundNet, and AVSBench datasets.

02

Learnable prompts effectively bridge semantic gaps between audio and visual inputs.

03

Outperforms fixed prompt methods in multimodal localization tasks.

Abstract

Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications