AURORA:Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

Ziyang Luo; Nian Liu; Fahad Shahbaz Khan; Junwei Han

arXiv:2508.02149·cs.CV·December 11, 2025

AURORA:Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

Ziyang Luo, Nian Liu, Fahad Shahbaz Khan, Junwei Han

PDF

Open Access

TL;DR

AURORA is a novel framework that enhances reference audio-visual segmentation by integrating structured reasoning and reinforcement learning, leading to improved accuracy and genuine understanding of multimodal cues.

Contribution

It introduces a Chain-of-Thought prompting mechanism, a segmentation feature distillation loss, and a two-stage training strategy with self-correction and reinforcement learning, advancing reasoning and segmentation performance.

Findings

01

Achieves state-of-the-art results on Ref-AVS benchmarks.

02

Effectively generalizes to unreferenced segmentation tasks.

03

Enhances reasoning capabilities without compromising segmentation accuracy.

Abstract

Reference Audio-Visual Segmentation (Ref-AVS) tasks challenge models to precisely locate sounding objects by integrating visual, auditory, and textual cues. Existing methods often lack genuine semantic understanding, tending to memorize fixed reasoning patterns. Furthermore, jointly training for reasoning and segmentation can compromise pixel-level precision. To address these issues, we introduce AURORA, a novel framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation. We employ a structured Chain-of-Thought (CoT) prompting mechanism to guide the model through a step-by-step reasoning process and introduce a novel segmentation feature distillation loss to effectively integrate these reasoning abilities without sacrificing segmentation performance. To further cultivate the model's genuine reasoning capabilities, we devise a further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Multimodal Machine Learning Applications