It's a Matter of Time: Three Lessons on Long-Term Motion for Perception
Willem Davison, Xinyue Hao, Laura Sevilla-Lara

TL;DR
This paper explores the significance of long-term motion information in perception, demonstrating its advantages over image data in understanding actions, generalization, and computational efficiency across various perceptual tasks.
Contribution
It provides three key lessons on the importance, generalization, and efficiency of long-term motion representations for perception tasks, highlighting their potential for future model design.
Findings
Long-term motion representations effectively understand actions, objects, and materials.
They outperform image representations in low-data and zero-shot scenarios.
Motion representations offer a better accuracy-GFLOPs trade-off than standard video models.
Abstract
Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Face Recognition and Perception
