VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization
Seul-Ki Yeom, Marcel Simon, Eunbin Lee, and Tae-Ho Kim

TL;DR
VINO introduces a self-supervised learning framework that leverages dense video data and structural priors to learn object-centric image representations, effectively disentangling foreground objects from background context.
Contribution
VINO proposes a novel teacher-student SSL method with a structural information bottleneck and cross-time distillation to improve non-contextual object representation learning from videos.
Findings
Achieves 34.8% CorLoc on PASCAL VOC, outperforming prior methods.
Effectively disentangles foreground objects from background cues.
Produces shape-biased, focused object representations.
Abstract
Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
