Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual

Sukrit Sriratanawilai; Jhayahgrit Thongwat; Romrawin Chumpu; Patomporn Payoungkhamdee; Sarana Nutanong; Peerat Limkonchotiwat

arXiv:2510.26271·cs.CL·October 31, 2025

Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual

Sukrit Sriratanawilai, Jhayahgrit Thongwat, Romrawin Chumpu, Patomporn Payoungkhamdee, Sarana Nutanong, Peerat Limkonchotiwat

PDF

TL;DR

This paper investigates how knowledge distillation affects the performance and multilingual robustness of smaller vision-language models, revealing that some configurations can maintain or improve cross-lingual retrieval despite model compression.

Contribution

It provides a controlled empirical analysis of different distillation approaches on multilingual VLMs, highlighting their impact on cross-lingual consistency and downstream task stability.

Findings

01

Some distillation configurations preserve or improve multilingual retrieval robustness.

02

Certain approaches fail to maintain cross-task stability after model compression.

03

Design-sensitive trade-offs exist between accuracy and robustness in multilingual distillation.

Abstract

Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.