video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei, Li, Zejun MA, Chao Zhang

TL;DR
This paper introduces video-SALMONN-o1, a reasoning-enhanced audio-visual large language model designed for general video understanding, featuring a new dataset, a novel optimization method, and a new benchmark, achieving significant performance improvements.
Contribution
The paper presents the first open-source reasoning-enhanced audio-visual LLM for general video understanding, with a new reasoning dataset, process direct preference optimization, and a reasoning-intensive benchmark.
Findings
Achieves 3-8% accuracy improvements over baseline.
pDPO improves performance by 6-8%.
Enables zero-shot synthetic video detection.
Abstract
While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Natural Language Processing Techniques
