WIDIn: Wording Image for Domain-Invariant Representation in   Single-Source Domain Generalization

Jiawei Ma; Yulei Niu; Shiyuan Huang; Guangxing Han; Shih-Fu Chang

arXiv:2405.18405·cs.CV·May 29, 2024

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

Jiawei Ma, Yulei Niu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang

PDF

Open Access

TL;DR

WIDIn is a self-supervised framework that enhances domain-invariant visual representations by aligning image embeddings with fine-grained language descriptions, improving generalization across diverse domains without test data.

Contribution

The paper introduces WIDIn, a novel self-supervision method that leverages language embeddings to disentangle visual features for better domain generalization in single-source settings.

Findings

01

Effective on three domain generalization datasets

02

Works with pretrained vision-language and uni-modal models

03

Improves domain-invariant representation quality

Abstract

Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains. However, as the image description is mostly at coarse-grained level and ignores visual details, the resulted embeddings are still ineffective in overcoming complexity of domains at inference time. We present a self-supervision framework WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation, by only leveraging data in a single domain and without any test prior. Specifically, for each image, we first estimate the language embedding with fine-grained alignment, which can be consequently used to adaptively identify and then remove domain-specific counterpart from the raw visual embedding. WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Batch Normalization · Attention Dropout · Linear Layer · InfoNCE · Adam