See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent
Tianci Tang, Tielong Cai, Hongwei Wang, Gaoang Wang

TL;DR
Sea$^2$ introduces an active perception framework that adapts the deployment of frozen perception models via an intelligent agent, improving performance in novel indoor scenes without retraining or scene-specific annotations.
Contribution
It proposes a novel paradigm that uses a pose-control agent to adapt perception model deployment, avoiding retraining and scene-specific annotations in cross-domain visual tasks.
Findings
Achieved 13.54% improvement in visual grounding
Achieved 15.92% improvement in segmentation
Achieved 27.68% improvement in 3D box estimation
Abstract
Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
