video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang

TL;DR
video-SALMONN 2 introduces a novel multi-round preference optimization method that enhances caption quality and improves video understanding, achieving state-of-the-art results in video description and question answering tasks.
Contribution
The paper proposes multi-round direct preference optimization (MrDPO) with a caption-quality objective, enabling continual improvement and superior performance of large language models in video understanding.
Findings
Achieves SOTA results on multiple benchmarks with 3B, 7B, and 72B models.
Produces more detailed and accurate captions than proprietary systems.
Transfers captioning improvements to complex video question answering tasks.
Abstract
We present video-SALMONN 2, a family of audio-visual large language models that set new state-of-the-art (SOTA) results in video description and question answering (QA). Our core contribution is multi-round direct preference optimisation (MrDPO), paired with a caption-quality objective that jointly rewards completeness and factual accuracy. Unlike standard DPO with a fixed reference policy, MrDPO periodically refreshes the reference by bootstrapping from a newly re-initialised lightweight adapter trained on the latest preferences, avoiding reference staleness and enabling continual improvement. This strategy produces captions that are consistently more detailed and accurate than those from proprietary systems such as GPT-4o and Gemini-1.5 Pro. We further distil these gains by using our model to generate a high-quality video-caption corpus for supervised fine-tuning of new models,…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-written and easy to follow. - The proposed method achieves the best performance compared to other methods in diverse metrics and datasets. In particular, it shows better performance than GPT-4o.
- MrDPO seems to be very similar to the concept of GRPO. As GRPO updates and adopts old policy model, MrDPO uses the updated model as a reference model. Could you compare MrDPO with GRPO? - I’m concerned about the complexity and cost of the MrDPO pipeline. The proposed MrDPO requires: evaluating the quality of every video caption using GPT-3.5. In comparison of using the verifiable reward function such as BLEU, ROUGE, the caption evaluation with GPT is time-consuming and cost-intensive. - It wou
1. video-SALMON 2 demonstrates strong video captioning capability, surpassing many well-known models such as Gemini 1.5, VideoLLaMA3, and Qwen2.5-VL. 2. video-SALMON 2 provides a self-constructed caption benchmark, which helps advance the video captioning capabilities of other models. 3. video-SALMON 2 exhibits powerful audio-visual question-answering performance, achieving strong results on VideoMME.
1. The paper’s first main contribution MrDPO (a DPO variant that updates the reference model during training to improve stability) is not novel; a similar approach was already proposed in TR-DPO [1]. Moreover, the paper’s primary focus is on enhancing video captioning performance, yet the design of MrDPO is unrelated to video captioning and appears to be a generic DPO method. The motivation behind MrDPO and the mechanism by which it improves video captioning capability remain unclear. 2. The se
* **Principled, iterative preference learning for captioning.** MrDPO mitigates reference staleness by **merging the previous LoRA into the backbone and re-initializing a fresh LoRA proxy each round**, stabilized by a guided DPO loss with a small SFT term; ablations show cumulative reductions in caption error. * **Task-aligned reward with atomic events.** The **missing + hallucination** metric operationalizes caption *completeness and factuality*, enabling LLM-judged preference pairs for RL.
* **Evaluator dependency and potential bias.** The atomic-event pipeline relies on **text LLMs** (e.g., GPT-3.5/4o) for both event extraction and preference decisions; while some human checks exist, the paper still **inherits evaluator bias/noise** and only partially audits it. * **Caption→QA transfer hinges on data regeneration, not RL directly.** Authors note MrDPO mainly boosts captioning; general QA gains arrive **after** re-annotating and SFT, raising questions about how much MrDPO helps
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDirect Preference Optimization
