ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Ye Won Byun; Cathy Jiao; Shahriar Noroozizadeh; Jimin Sun; Rosa; Vitiello

arXiv:2406.17876·cs.CV·June 27, 2024·1 cites

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh, Jimin Sun, Rosa, Vitiello

PDF

Open Access

TL;DR

This paper proposes a method that uses pre-trained CLIP encoders as an auxiliary module to improve generalization in the ALFRED task, especially for unseen environments and objects.

Contribution

It introduces a novel approach of integrating CLIP as an auxiliary objective rather than replacing the visual encoder, enhancing model performance on unseen data.

Findings

01

Improved task performance on unseen validation set.

02

CLIP helps with object descriptions and small object detection.

03

Enhanced interpretation of rare words.

Abstract

We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Software Testing and Debugging Techniques · Real-Time Systems Scheduling

MethodsSoftmax · Layer Normalization · Contrastive Language-Image Pre-training · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Attention Is All You Need · Linear Layer