CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Hang Wu; Yujun Cai; Zehao Li; Haonan Ge; Bowen Sun; Junsong Yuan; Yiwei Wang

arXiv:2602.00181·cs.CV·April 15, 2026

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, Yiwei Wang

PDF

TL;DR

CamReasoner introduces a structured inference framework for understanding camera movements in videos, leveraging explicit reasoning and reinforcement learning to improve accuracy over existing models.

Contribution

It reformulates camera movement understanding as a structured inference task using RL and a large reasoning dataset, pioneering this approach in the field.

Findings

01

Improves binary classification accuracy from 73.8% to 78.4%.

02

Enhances VQA accuracy from 60.9% to 74.5%.

03

First to employ RL for logical alignment in camera movement understanding.

Abstract

Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present \textbf{CamReasoner}, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to articulate spatio-temporal observations and reason about motion patterns within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. To the best of our knowledge, \textbf{we are the first to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.