ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
Yeonkyung Lee, Dayun Ju, Youngmin Kim, Seil Kang, Seong Jae Hwang

TL;DR
ViKey introduces a lightweight visual prompting method combined with a keyword-frame mapping to significantly improve temporal reasoning in VideoLLMs while reducing computational costs.
Contribution
The paper proposes ViKey, a training-free framework that enhances temporal understanding in VideoLLMs using visual prompts and a lightweight keyword-frame mapping module.
Findings
Improves temporal reasoning in VideoLLMs with sparse frames.
Maintains dense-frame performance with only 20% of frames.
Enhances model perception of temporal continuity and references.
Abstract
Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
