Selective Visual Representations Improve Convergence and Generalization   for Embodied AI

Ainaz Eftekhar; Kuo-Hao Zeng; Jiafei Duan; Ali Farhadi; Ani Kembhavi,; Ranjay Krishna

arXiv:2311.04193·cs.CV·March 12, 2024·2 cites

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

Ainaz Eftekhar, Kuo-Hao Zeng, Jiafei Duan, Ali Farhadi, Ani Kembhavi,, Ranjay Krishna

PDF

Open Access 1 Video

TL;DR

This paper introduces a task-conditioned filtering method using a learnable codebook to improve visual representation relevance, leading to better convergence, generalization, and performance in embodied AI tasks across multiple benchmarks.

Contribution

A novel, parameter-efficient approach employing a learnable codebook to filter visual inputs based on task relevance, enhancing embodied AI performance and generalization.

Findings

01

Achieves state-of-the-art results in object goal navigation and object displacement.

02

Filtered representations generalize better across different simulation environments.

03

Agents explore environments more effectively and focus on task-relevant visual cues.

Abstract

Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Selective Visual Representations Improve Convergence and Generalization for Embodied AI· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)

MethodsContrastive Language-Image Pre-training · Focus