Selective Visual Representations Improve Convergence and Generalization for Embodied AI
Ainaz Eftekhar, Kuo-Hao Zeng, Jiafei Duan, Ali Farhadi, Ani Kembhavi,, Ranjay Krishna

TL;DR
This paper introduces a task-conditioned filtering method using a learnable codebook to improve visual representation relevance, leading to better convergence, generalization, and performance in embodied AI tasks across multiple benchmarks.
Contribution
A novel, parameter-efficient approach employing a learnable codebook to filter visual inputs based on task relevance, enhancing embodied AI performance and generalization.
Findings
Achieves state-of-the-art results in object goal navigation and object displacement.
Filtered representations generalize better across different simulation environments.
Agents explore environments more effectively and focus on task-relevant visual cues.
Abstract
Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
MethodsContrastive Language-Image Pre-training · Focus
