Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang

TL;DR
This paper introduces EgoViT, a vision Transformer framework that learns stable, persistent object representations from unlabeled egocentric videos by combining intra-frame learning, geometric grounding, and temporal consistency.
Contribution
EgoViT is the first unified Transformer-based approach that jointly discovers and stabilizes object representations in uncurated first-person videos without manual annotations.
Findings
Achieves +8.0% CorLoc in unsupervised object discovery
Improves mIoU by +4.8% in semantic segmentation
Demonstrates robustness to varied geometric priors
Abstract
Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
