LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages   in Multimodal Image Retrieval Task

Ali Asgarov; Samir Rustamov

arXiv:2408.13909·cs.CV·August 27, 2024

LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task

Ali Asgarov, Samir Rustamov

PDF

Open Access

TL;DR

This paper introduces LowCLIP, a modified CLIP architecture optimized for low-resource languages like Azerbaijani, using data augmentation and domain-specific training to improve image retrieval performance.

Contribution

It adapts CLIP for low-resource languages by integrating synthetic data, augmentation, and specialized training, achieving state-of-the-art results in multimodal image retrieval.

Findings

01

EfficientNet0 and Tiny Swin Transformer perform best on benchmark datasets.

02

Augmentation techniques improve retrieval MAP scores significantly.

03

Achieved new state-of-the-art results in vision-language retrieval for low-resource languages.

Abstract

This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Adam · Layer Normalization · Weight Decay · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · WordPiece · Attention Dropout · Linear Warmup With Linear Decay