Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception
Marcel Simon, Tae-Ho Kim, Seul-Ki Yeom

TL;DR
This paper proposes a video self-distillation method for single-image encoders that incorporates temporal cues from videos, improving geometry-aware perception without complex tracking or optical flow, advancing physically plausible AI models.
Contribution
Introducing a video-distilled training approach for single-image encoders that injects 3D spatial and temporal priors without additional complex methods.
Findings
Increases mIoU on ADE20K from 35.0 to 36.4 with minimal changes
Pre-training on a 2-hour video enhances geometry-aware perception
Method remains compatible as a drop-in replacement for image-only pipelines
Abstract
Self-supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video-distilled single-image encoder trained to predict the next-frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre-training on a single 2-hour video, our approach raises the mean Intersection-over-Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop-in replacement for image-only pipelines. Our results highlight video self-distillation as a lightweight route to geometry-aware perception an essential ingredient for physically plausible world models and Physical AI.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · 3D Shape Modeling and Analysis
