video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Guangzhi Sun; Yudong Yang; Jimin Zhuang; Changli Tang; Yixuan Li; Wei; Li; Zejun MA; Chao Zhang

arXiv:2502.11775·cs.CV·February 18, 2025·2 cites

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei, Li, Zejun MA, Chao Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces video-SALMONN-o1, a reasoning-enhanced audio-visual large language model designed for general video understanding, featuring a new dataset, a novel optimization method, and a new benchmark, achieving significant performance improvements.

Contribution

The paper presents the first open-source reasoning-enhanced audio-visual LLM for general video understanding, with a new reasoning dataset, process direct preference optimization, and a reasoning-intensive benchmark.

Findings

01

Achieves 3-8% accuracy improvements over baseline.

02

pDPO improves performance by 6-8%.

03

Enables zero-shot synthetic video detection.

Abstract

While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BriansIDP/video-SALMONN-o1
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Natural Language Processing Techniques