Geometry-Guided Camera Motion Understanding in VideoLLMs
Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su

TL;DR
This paper introduces a framework for understanding and improving camera motion recognition in VideoLLMs by creating a new dataset, diagnosing current limitations, and proposing a geometric cue injection method to enhance model awareness.
Contribution
It presents a large-scale synthetic dataset, a benchmark for camera motion understanding, and a lightweight method to inject geometric cues into VideoLLMs without extensive retraining.
Findings
VideoLLMs show substantial errors in recognizing camera motion primitives.
Camera motion cues are weakly represented in deeper ViT blocks.
The proposed cue injection improves motion recognition and camera awareness.
Abstract
Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of , , and . We curate , a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging
