Simple but Effective: CLIP Embeddings for Embodied AI
Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, Aniruddha Kembhavi

TL;DR
This paper demonstrates that CLIP embeddings, used in simple baseline models without task-specific modifications, significantly improve performance on Embodied AI tasks like ObjectNav and rearrangement, surpassing more complex prior methods.
Contribution
The authors introduce EmbCLIP, a straightforward approach leveraging CLIP embeddings for Embodied AI, achieving state-of-the-art results without task-specific architectures or auxiliary training.
Findings
EmbCLIP outperforms existing methods on RoboTHOR ObjectNav by 20 points in Success Rate.
It surpasses the iTHOR Rearrangement leaderboard, doubling the % Fixed Strict metric.
CLIP representations encode semantic primitives more effectively than ImageNet-pretrained backbones.
Abstract
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation. We investigate the effectiveness of CLIP visual backbones for Embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of semantic maps), auxiliary tasks during training, or depth maps -- yet we find that our improved baselines perform very well across a range of tasks and simulators. EmbCLIP tops the RoboTHOR ObjectNav leaderboard by a huge margin of 20 pts (Success Rate). It tops the iTHOR 1-Phase Rearrangement leaderboard, beating the next best submission, which employs Active Neural Mapping, and more than doubling the % Fixed Strict metric (0.08 to 0.17). It also beats the winners of the 2021 Habitat ObjectNav…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
MethodsContrastive Language-Image Pre-training
