ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh, Jimin Sun, Rosa, Vitiello

TL;DR
This paper proposes a method that uses pre-trained CLIP encoders as an auxiliary module to improve generalization in the ALFRED task, especially for unseen environments and objects.
Contribution
It introduces a novel approach of integrating CLIP as an auxiliary objective rather than replacing the visual encoder, enhancing model performance on unseen data.
Findings
Improved task performance on unseen validation set.
CLIP helps with object descriptions and small object detection.
Enhanced interpretation of rare words.
Abstract
We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Software Testing and Debugging Techniques · Real-Time Systems Scheduling
MethodsSoftmax · Layer Normalization · Contrastive Language-Image Pre-training · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Attention Is All You Need · Linear Layer
