Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Dongnuan Cai, Henghui Du, Chang Zhou, Xi Chen, Dan Guo, Hongyuan Zhang, Xuelong Li, Di Hu

TL;DR
Crab$^{+}$ is a scalable, unified audio-visual scene understanding model that effectively manages task heterogeneity and negative transfer through explicit cooperation, outperforming specialized models across diverse benchmarks.
Contribution
The paper introduces Crab$^{+}$, a novel AV-LLM with explicit data and model cooperation mechanisms, including a large instruction-tuning dataset and dynamic routing for inter-task relationship modeling.
Findings
Achieves positive transfer in 88% of tasks, reversing negative transfer trends.
Outperforms specialized models on multiple AV scene understanding benchmarks.
Demonstrates broader task coverage and robustness across AV-LLM paradigms.
Abstract
Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab, a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Music and Audio Processing
