VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Yueqian Wang; Xiaojun Meng; Yuxuan Wang; Jianxin Liang; Jiansheng Wei; Huishuai Zhang; Dongyan Zhao

arXiv:2411.17991·cs.CV·November 25, 2025

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces a novel video-text duet interaction format for VideoLLMs, enabling real-time, time-sensitive video comprehension and responses during continuous video playback, which improves performance and responsiveness.

Contribution

It proposes a new interaction format and a dedicated training dataset, MMDuetIT, along with a benchmark MAGQA, to enhance real-time video understanding in VideoLLMs.

Findings

01

Achieved 76% CIDEr on YouCook2 dense captioning

02

Reached 90% mAP on QVHighlights highlight detection

03

Improved temporal grounding with 25% [email protected] on Charades-STA

Abstract

Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yellow-binary-tree/mmduet
pytorchOfficial

Models

🤗
wangyueqian/MMDuet
model· 6 dl· ♡ 4
6 dl♡ 4

Datasets

wangyueqian/MMDuetIT
dataset· 30 dl
30 dl

Videos

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format· underline

Taxonomy

TopicsEducational Tools and Methods

MethodsFocus