Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo

TL;DR
CroBo is a novel self-supervised learning framework that encodes detailed scene semantics and spatial configurations into visual states, improving robot policy learning and scene understanding from streaming videos.
Contribution
It introduces CroBo, a global-to-local reconstruction method that explicitly encodes what-is-where in visual states for robotic perception.
Findings
Achieves state-of-the-art results on robot policy benchmarks.
Preserves pixel-level scene composition in learned representations.
Encodes scene dynamics and element interactions effectively.
Abstract
For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
