Geometry-Guided Camera Motion Understanding in VideoLLMs

Haoan Feng; Sri Harsha Musunuri; Guan-Ming Su

arXiv:2603.13119·cs.CV·March 26, 2026

Geometry-Guided Camera Motion Understanding in VideoLLMs

Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a framework for understanding and improving camera motion recognition in VideoLLMs by creating a new dataset, diagnosing current limitations, and proposing a geometric cue injection method to enhance model awareness.

Contribution

It presents a large-scale synthetic dataset, a benchmark for camera motion understanding, and a lightweight method to inject geometric cues into VideoLLMs without extensive retraining.

Findings

01

VideoLLMs show substantial errors in recognizing camera motion primitives.

02

Camera motion cues are weakly represented in deeper ViT blocks.

03

The proposed cue injection improves motion recognition and camera awareness.

Abstract

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $benchmarking$ , $diagnosis$ , and $injection$ . We curate $CameraMotionDataset$ , a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark-- $CameraMotionVQA$ . Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

fengyee/camera-motion-dataset-and-benchmark
dataset· 77 dl
77 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging