LARE: Latent Augmentation using Regional Embedding with Vision-Language   Model

Kosuke Sakurai; Tatsuya Ishii; Ryotaro Shimizu; Linxin Song; Masayuki; Goto

arXiv:2409.12597·cs.CV·September 20, 2024

LARE: Latent Augmentation using Regional Embedding with Vision-Language Model

Kosuke Sakurai, Tatsuya Ishii, Ryotaro Shimizu, Linxin Song, Masayuki, Goto

PDF

Open Access

TL;DR

LARE introduces a regional embedding approach for vision-language models, enabling effective data augmentation across unseen domains and improving image classification accuracy in diverse conditions.

Contribution

The paper proposes LARE, a novel regional embedding method that enhances vision-language models' domain adaptation and classification performance through latent region sampling.

Findings

01

LARE outperforms previous models on three benchmark datasets.

02

LARE demonstrates robustness with limited and imbalanced data.

03

LARE effectively generalizes to unseen domains.

Abstract

In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual questions." Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling