Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception

Marcel Simon; Tae-Ho Kim; Seul-Ki Yeom

arXiv:2507.19272·cs.CV·July 28, 2025

Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception

Marcel Simon, Tae-Ho Kim, Seul-Ki Yeom

PDF

Open Access

TL;DR

This paper proposes a video self-distillation method for single-image encoders that incorporates temporal cues from videos, improving geometry-aware perception without complex tracking or optical flow, advancing physically plausible AI models.

Contribution

Introducing a video-distilled training approach for single-image encoders that injects 3D spatial and temporal priors without additional complex methods.

Findings

01

Increases mIoU on ADE20K from 35.0 to 36.4 with minimal changes

02

Pre-training on a 2-hour video enhances geometry-aware perception

03

Method remains compatible as a drop-in replacement for image-only pipelines

Abstract

Self-supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video-distilled single-image encoder trained to predict the next-frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre-training on a single 2-hour video, our approach raises the mean Intersection-over-Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop-in replacement for image-only pipelines. Our results highlight video self-distillation as a lightweight route to geometry-aware perception an essential ingredient for physically plausible world models and Physical AI.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · 3D Shape Modeling and Analysis