Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

Yuting Tan; Xilong Cheng; Yunxiao Qin; Zhengnan Li; Jingjing Zhang

arXiv:2603.13912·cs.CV·March 17, 2026

Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang

PDF

Open Access

TL;DR

This paper introduces EgoViT, a vision Transformer framework that learns stable, persistent object representations from unlabeled egocentric videos by combining intra-frame learning, geometric grounding, and temporal consistency.

Contribution

EgoViT is the first unified Transformer-based approach that jointly discovers and stabilizes object representations in uncurated first-person videos without manual annotations.

Findings

01

Achieves +8.0% CorLoc in unsupervised object discovery

02

Improves mIoU by +4.8% in semantic segmentation

03

Demonstrates robustness to varied geometric priors

Abstract

Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications