RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

Muyi Sun; Yixuan Wang; Hong Wang; Chen Su; Man Zhang; Xingqun Qi; Qi Li; Zhenan Sun

arXiv:2603.09809·cs.CV·March 11, 2026

RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang, Xingqun Qi, Qi Li, Zhenan Sun

PDF

Open Access

TL;DR

This paper introduces a new fine-grained audio-visual learning task called Region-Aware Sound Source Understanding, supported by new datasets and a novel transformer-based model that achieves state-of-the-art results.

Contribution

The paper proposes a novel fine-grained AV learning task, creates two annotated datasets, and develops a transformer-based model with modules for improved sound source understanding.

Findings

01

Achieves state-of-the-art performance on sound source understanding benchmarks.

02

Demonstrates the effectiveness of the proposed datasets and model modules.

03

Validates the feasibility of fine-grained, region-aware AV learning.

Abstract

Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies