R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios
Lu Zhu, Tiantian Geng, Yangye Chen, Teng Wang, Ping Lu, Feng Zheng

TL;DR
This paper introduces R-AVST, a comprehensive dataset with fine-grained spatio-temporal annotations for complex video audio-visual reasoning, and proposes AVST-Zero, a reinforcement learning model that advances reasoning capabilities in real-world scenarios.
Contribution
The paper presents the first dataset for real-world audio-visual spatio-temporal reasoning and a reinforcement learning model that directly optimizes reasoning behavior without intermediate supervision.
Findings
R-AVST contains over 5K videos with 27K objects across 100 event types.
AVST-Zero achieves competitive performance on reasoning tasks.
Extensive experiments validate the dataset's effectiveness and the model's reasoning capabilities.
Abstract
Recently, rapid advancements have been made in multimodal large language models (MLLMs), especially in video understanding tasks. However, current research focuses on simple video scenarios, failing to reflect the complex and diverse nature of real-world audio-visual events in videos. To bridge this gap, we firstly introduce R-AVST, a dataset for audio-visual reasoning featuring fine-grained spatio-temporal annotations. In constructing this, we design a pipeline consisting of LLM-based key object extraction, automatic spatial annotation and manual quality inspection, resulting in over 5K untrimmed videos with 27K objects across 100 types of audio-visual events. Building on this dataset, we define three core tasks for spatio-temporal reasoning in audio-visual scenes and generate more than 8K high-quality, evenly distributed question-answer pairs to effectively benchmark model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Speech and Audio Processing
