I-Perceive: A Foundation Model for Active Perception with Language Instructions

Yongxi Huang; Zhuohang Wang; Wenjing Tang; Cewu Lu; Panpan Cai

arXiv:2603.00600·cs.RO·March 3, 2026

I-Perceive: A Foundation Model for Active Perception with Language Instructions

Yongxi Huang, Zhuohang Wang, Wenjing Tang, Cewu Lu, Panpan Cai

PDF

Open Access

TL;DR

I-Perceive is a foundation model enabling robots to actively adjust their viewpoints based on natural language instructions, combining semantic and geometric understanding to operate effectively in large-scale indoor environments.

Contribution

The paper introduces I-Perceive, a novel foundation model that integrates vision-language and geometric reasoning for active perception conditioned on open-ended language instructions.

Findings

01

Outperforms state-of-the-art VLMs in prediction accuracy

02

Demonstrates strong zero-shot generalization to new scenes and tasks

03

Effectively follows open-ended language instructions for active perception

Abstract

Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI