VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

Fufangchen Zhao; Liao Zhang; Daiqi Shi; Yuanjun Gao; Chen Ye; Yang Cai; Jian Gao; Danfeng Yan

arXiv:2511.18823·cs.CV·November 25, 2025

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

Fufangchen Zhao, Liao Zhang, Daiqi Shi, Yuanjun Gao, Chen Ye, Yang Cai, Jian Gao, Danfeng Yan

PDF

Open Access

TL;DR

VideoPerceiver is a new multimodal video language model that improves fine-grained temporal perception, especially for brief actions and rare events, through a two-stage training process involving contrastive learning and reinforcement learning.

Contribution

It introduces a novel training framework with key-information-missing video construction and a relative reward mechanism to enhance temporal sensitivity in video understanding.

Findings

01

Outperforms state-of-the-art models on fine-grained action benchmarks.

02

Effectively captures transient events in long videos.

03

Maintains strong performance on standard video-language tasks.

Abstract

We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs' limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition