Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data
Jialiang Li, Yi Qiao, Yunhan Guo, Changwen Chen, Wenzhao Lian

TL;DR
This paper introduces CoMe-VLA, a novel framework that leverages large-scale egocentric human data to enable robots to perform versatile active perception and manipulation in complex, unconstrained environments.
Contribution
It formalizes non-Markovian active perception and presents a new cognitive, memory-aware vision-language-action framework trained on extensive human data.
Findings
Demonstrates robustness across diverse long-horizon tasks
Shows adaptability in multiple active perception scenarios
Achieves effective autonomous sub-task transitions
Abstract
Achieving generalizable manipulation in unconstrained environments requires the robot to proactively resolve information uncertainty, i.e., the capability of active perception. However, existing methods are often confined in limited types of sensing behaviors, restricting their applicability to complex environments. In this work, we formalize active perception as a non-Markovian process driven by information gain and decision branching, providing a structured categorization of visual active perception paradigms. Building on this perspective, we introduce CoMe-VLA, a cognitive and memory-aware vision-language-action (VLA) framework that leverages large-scale human egocentric data to learn versatile exploration and manipulation priors. Our framework integrates a cognitive auxiliary head for autonomous sub-task transitions and a dual-track memory system to maintain consistent self and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Multimodal Machine Learning Applications
