Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Dongnuan Cai; Henghui Du; Chang Zhou; Xi Chen; Dan Guo; Hongyuan Zhang; Xuelong Li; Di Hu

arXiv:2603.04128·cs.CV·March 5, 2026

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Dongnuan Cai, Henghui Du, Chang Zhou, Xi Chen, Dan Guo, Hongyuan Zhang, Xuelong Li, Di Hu

PDF

Open Access

TL;DR

Crab$^{+}$ is a scalable, unified audio-visual scene understanding model that effectively manages task heterogeneity and negative transfer through explicit cooperation, outperforming specialized models across diverse benchmarks.

Contribution

The paper introduces Crab$^{+}$, a novel AV-LLM with explicit data and model cooperation mechanisms, including a large instruction-tuning dataset and dynamic routing for inter-task relationship modeling.

Findings

01

Achieves positive transfer in 88% of tasks, reversing negative transfer trends.

02

Outperforms specialized models on multiple AV scene understanding benchmarks.

03

Demonstrates broader task coverage and robustness across AV-LLM paradigms.

Abstract

Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab $^{+}$ , a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Music and Audio Processing