video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models

Changli Tang; Yixuan Li; Yudong Yang; Jimin Zhuang; Guangzhi Sun; Wei Li; Zejun Ma; Chao Zhang

arXiv:2506.15220·cs.CV·September 29, 2025

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang

PDF

1 Repo 4 Models 1 Datasets 3 Reviews

TL;DR

video-SALMONN 2 introduces a novel multi-round preference optimization method that enhances caption quality and improves video understanding, achieving state-of-the-art results in video description and question answering tasks.

Contribution

The paper proposes multi-round direct preference optimization (MrDPO) with a caption-quality objective, enabling continual improvement and superior performance of large language models in video understanding.

Findings

01

Achieves SOTA results on multiple benchmarks with 3B, 7B, and 72B models.

02

Produces more detailed and accurate captions than proprietary systems.

03

Transfers captioning improvements to complex video question answering tasks.

Abstract

We present video-SALMONN 2, a family of audio-visual large language models that set new state-of-the-art (SOTA) results in video description and question answering (QA). Our core contribution is multi-round direct preference optimisation (MrDPO), paired with a caption-quality objective that jointly rewards completeness and factual accuracy. Unlike standard DPO with a fixed reference policy, MrDPO periodically refreshes the reference by bootstrapping from a newly re-initialised lightweight adapter trained on the latest preferences, avoiding reference staleness and enabling continual improvement. This strategy produces captions that are consistently more detailed and accurate than those from proprietary systems such as GPT-4o and Gemini-1.5 Pro. We further distil these gains by using our model to generate a high-quality video-caption corpus for supervised fine-tuning of new models,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The paper is well-written and easy to follow. - The proposed method achieves the best performance compared to other methods in diverse metrics and datasets. In particular, it shows better performance than GPT-4o.

Weaknesses

- MrDPO seems to be very similar to the concept of GRPO. As GRPO updates and adopts old policy model, MrDPO uses the updated model as a reference model. Could you compare MrDPO with GRPO? - I’m concerned about the complexity and cost of the MrDPO pipeline. The proposed MrDPO requires: evaluating the quality of every video caption using GPT-3.5. In comparison of using the verifiable reward function such as BLEU, ROUGE, the caption evaluation with GPT is time-consuming and cost-intensive. - It wou

Reviewer 02Rating 4Confidence 5

Strengths

1. video-SALMON 2 demonstrates strong video captioning capability, surpassing many well-known models such as Gemini 1.5, VideoLLaMA3, and Qwen2.5-VL. 2. video-SALMON 2 provides a self-constructed caption benchmark, which helps advance the video captioning capabilities of other models. 3. video-SALMON 2 exhibits powerful audio-visual question-answering performance, achieving strong results on VideoMME.

Weaknesses

1. The paper’s first main contribution MrDPO (a DPO variant that updates the reference model during training to improve stability) is not novel; a similar approach was already proposed in TR-DPO [1]. Moreover, the paper’s primary focus is on enhancing video captioning performance, yet the design of MrDPO is unrelated to video captioning and appears to be a generic DPO method. The motivation behind MrDPO and the mechanism by which it improves video captioning capability remain unclear. 2. The se

Reviewer 03Rating 4Confidence 3

Strengths

* **Principled, iterative preference learning for captioning.** MrDPO mitigates reference staleness by **merging the previous LoRA into the backbone and re-initializing a fresh LoRA proxy each round**, stabilized by a guided DPO loss with a small SFT term; ablations show cumulative reductions in caption error. * **Task-aligned reward with atomic events.** The **missing + hallucination** metric operationalizes caption *completeness and factuality*, enabling LLM-judged preference pairs for RL.

Weaknesses

* **Evaluator dependency and potential bias.** The atomic-event pipeline relies on **text LLMs** (e.g., GPT-3.5/4o) for both event extraction and preference decisions; while some human checks exist, the paper still **inherits evaluator bias/noise** and only partially audits it. * **Caption→QA transfer hinges on data regeneration, not RL directly.** Authors note MrDPO mainly boosts captioning; general QA gains arrive **after** re-annotating and SFT, raising questions about how much MrDPO helps

Code & Models

Repositories

bytedance/video-salmonn-2
pytorchOfficial

Models

Datasets

tsinghua-ee/video-SALMONN_2_testset
dataset· 111 dl
111 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDirect Preference Optimization