VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

Seul-Ki Yeom; Marcel Simon; Eunbin Lee; and Tae-Ho Kim

arXiv:2603.07222·cs.CV·March 10, 2026

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

Seul-Ki Yeom, Marcel Simon, Eunbin Lee, and Tae-Ho Kim

PDF

Open Access

TL;DR

VINO introduces a self-supervised learning framework that leverages dense video data and structural priors to learn object-centric image representations, effectively disentangling foreground objects from background context.

Contribution

VINO proposes a novel teacher-student SSL method with a structural information bottleneck and cross-time distillation to improve non-contextual object representation learning from videos.

Findings

01

Achieves 34.8% CorLoc on PASCAL VOC, outperforming prior methods.

02

Effectively disentangles foreground objects from background cues.

03

Produces shape-biased, focused object representations.

Abstract

Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications