I-Perceive: A Foundation Model for Active Perception with Language Instructions
Yongxi Huang, Zhuohang Wang, Wenjing Tang, Cewu Lu, Panpan Cai

TL;DR
I-Perceive is a foundation model enabling robots to actively adjust their viewpoints based on natural language instructions, combining semantic and geometric understanding to operate effectively in large-scale indoor environments.
Contribution
The paper introduces I-Perceive, a novel foundation model that integrates vision-language and geometric reasoning for active perception conditioned on open-ended language instructions.
Findings
Outperforms state-of-the-art VLMs in prediction accuracy
Demonstrates strong zero-shot generalization to new scenes and tasks
Effectively follows open-ended language instructions for active perception
Abstract
Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
