ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Quoc-Khang Tran; Minh-Thien Nguyen; Nguyen-Khang Pham

arXiv:2602.22678·cs.CV·February 27, 2026

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham

PDF

Open Access

TL;DR

ViCLIP-OT is a novel vision-language model tailored for Vietnamese image-text retrieval, combining contrastive learning with optimal transport to improve cross-modal alignment in low-resource settings.

Contribution

It introduces a new framework integrating SIGROT loss with contrastive learning specifically for Vietnamese, addressing the modality gap and enhancing retrieval performance.

Findings

01

Outperforms CLIP and SigLIP baselines on Vietnamese benchmarks.

02

Achieves 67.34% Recall@K on UIT-OpenViIC, 5.75 points higher than CLIP.

03

Surpasses CLIP by 11.72 percentage points in zero-shot on Crossmodal-3600.

Abstract

Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques