ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport
Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham

TL;DR
ViCLIP-OT is a novel vision-language model tailored for Vietnamese image-text retrieval, combining contrastive learning with optimal transport to improve cross-modal alignment in low-resource settings.
Contribution
It introduces a new framework integrating SIGROT loss with contrastive learning specifically for Vietnamese, addressing the modality gap and enhancing retrieval performance.
Findings
Outperforms CLIP and SigLIP baselines on Vietnamese benchmarks.
Achieves 67.34% Recall@K on UIT-OpenViIC, 5.75 points higher than CLIP.
Surpasses CLIP by 11.72 percentage points in zero-shot on Crossmodal-3600.
Abstract
Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
