LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
Ali Asgarov, Samir Rustamov

TL;DR
This paper introduces LowCLIP, a modified CLIP architecture optimized for low-resource languages like Azerbaijani, using data augmentation and domain-specific training to improve image retrieval performance.
Contribution
It adapts CLIP for low-resource languages by integrating synthetic data, augmentation, and specialized training, achieving state-of-the-art results in multimodal image retrieval.
Findings
EfficientNet0 and Tiny Swin Transformer perform best on benchmark datasets.
Augmentation techniques improve retrieval MAP scores significantly.
Achieved new state-of-the-art results in vision-language retrieval for low-resource languages.
Abstract
This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Adam · Layer Normalization · Weight Decay · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · WordPiece · Attention Dropout · Linear Warmup With Linear Decay
