MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, Dongyan Zhao

TL;DR
This paper introduces MMDuet2, a proactive video multimodal large language model that autonomously decides when to respond during video playback using multi-turn reinforcement learning, improving response timing and quality.
Contribution
MMDuet2 is the first to apply multi-turn RL for proactive interaction in Video MLLMs, eliminating manual threshold tuning and precise response annotations.
Findings
Outperforms existing proactive Video MLLMs in response timing and quality
Achieves state-of-the-art on ProactiveVideoQA benchmark
Trained on 52k videos with dialogue data
Abstract
Recent advances in video multimodal large language models (Video MLLMs) have significantly enhanced video understanding and multi-modal interaction capabilities. While most existing systems operate in a turn-based manner where the model can only reply after user turns, proactively deciding when to reply during video playback presents a promising yet challenging direction for real-time applications. In this work, we propose a novel text-to-text approach to proactive interaction, where the model autonomously determines whether to respond or remain silent at each turn based on dialogue history and visual context up to current frame of an streaming video. To overcome difficulties in previous methods such as manually tuning response decision thresholds and annotating precise reply times, we introduce a multi-turn RL based training method that encourages timely and accurate responses without…
Peer Reviews
Decision·ICLR 2026 Poster
1) This paper is one of the first batch to investigate RL training (esp. GRPO) for proactively answering in online streaming video settings, which is of substantial novelty. Also, the paper investigate the reward, the key component of GRPO, and model it specifically for online video settings (PAUC). 2) The authors design a training dataset especially for online video streaming setting and corresponding chat template. Looking forward to the dataset open sourcing. 3) The MMDuet2 trained by SFT a
There are no major technical concerns about this paper, but I want to address some minor points as follows: 1) As the key component to apply GRPO to online video settings, the ablations on rewards should be more addressed. Did authors try other rewards than PAUC? Please compare several reward formulations and discuss why PAUC is preferred. 2) The author is encouraged to report the actual inference speed and latency of the MMDuet2 to see if it is realtime in practical scenes. Also, the organiz
1.The paper addresses proactive interaction, which is an important and challenging promblem for making Video MLLMs more natural and practical in real-time applications. 2. The use of RL to overcome the difficulty of precise reply time annotation is a promising avenue, and the reward mechanism design theoretically considers timeliness, accuracy, and redundancy. 3. The creation of a large-scale new dataset (52k videos with two dialogue types) provides a valuable resource for research in this f
1. The central contribution of this paper lies in rl , which is explicitly designed to improve proactive interaction timing. However, the paper reports that during training on complex ego-centric video tasks, the model exhibited reward hacking behavior—generating large amounts of repetitive content. Although this issue is solved by early stopping, such manual action may show a instability in the reward design and optimization process. 2. The occurrence of reward hacking indicates that the learn
1. This paper innovatively introduces multi-round reinforcement learning, using a reward mechanism to teach the model to find the optimal response time, cleverly circumventing the challenge of precise time labeling. 2. The authors constructed a large-scale dataset containing 52k videos, providing a solid data foundation for training more robust active models. 3. Experimental results show that the MMDuet2 model trained with SFT+RL outperforms previous state-of-the-art models and our own model tra
1. Using the "NO REPLY" text token is a concise and universal approach, but it also means that if the model chooses not to respond, a complete generation process (generating both tokens) is still required, which limits its inference efficiency. 2. The study on the reward component is insufficient, and related ablation experiments are lacking. The total reward is a weighted sum of four components, but the paper only mentions "a certain hyperparameter search" providing a set of weights without con
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
