Seeing without Pixels: Perception from Camera Trajectories
Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han

TL;DR
This paper demonstrates that camera trajectories alone can effectively encode video content, enabling various perception tasks without relying on pixel data, through a novel contrastive learning approach.
Contribution
It introduces CamFormer, a new encoder that aligns camera trajectories with language, revealing their rich informational content for video understanding.
Findings
Camera trajectories are highly informative for video content recognition.
CamFormer embeddings perform well across diverse downstream tasks.
Representations are robust across different camera pose estimation methods.
Abstract
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed provide valuable cues about "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
