AURORA:Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation
Ziyang Luo, Nian Liu, Fahad Shahbaz Khan, Junwei Han

TL;DR
AURORA is a novel framework that enhances reference audio-visual segmentation by integrating structured reasoning and reinforcement learning, leading to improved accuracy and genuine understanding of multimodal cues.
Contribution
It introduces a Chain-of-Thought prompting mechanism, a segmentation feature distillation loss, and a two-stage training strategy with self-correction and reinforcement learning, advancing reasoning and segmentation performance.
Findings
Achieves state-of-the-art results on Ref-AVS benchmarks.
Effectively generalizes to unreferenced segmentation tasks.
Enhances reasoning capabilities without compromising segmentation accuracy.
Abstract
Reference Audio-Visual Segmentation (Ref-AVS) tasks challenge models to precisely locate sounding objects by integrating visual, auditory, and textual cues. Existing methods often lack genuine semantic understanding, tending to memorize fixed reasoning patterns. Furthermore, jointly training for reasoning and segmentation can compromise pixel-level precision. To address these issues, we introduce AURORA, a novel framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation. We employ a structured Chain-of-Thought (CoT) prompting mechanism to guide the model through a step-by-step reasoning process and introduce a novel segmentation feature distillation loss to effectively integrate these reasoning abilities without sacrificing segmentation performance. To further cultivate the model's genuine reasoning capabilities, we devise a further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Multimodal Machine Learning Applications
