Unified Vision-Language Modeling via Concept Space Alignment

Yifu Qiu; Paul-Ambroise Duquenne; Holger Schwenk

arXiv:2603.01096·cs.CV·March 3, 2026

Unified Vision-Language Modeling via Concept Space Alignment

Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk

PDF

Open Access

TL;DR

This paper presents V-SONAR, a unified vision-language embedding space extending SONAR, enabling multilingual, multi-modal understanding and surpassing state-of-the-art models in video captioning and multilingual tasks.

Contribution

The introduction of V-SONAR as a post-hoc aligned embedding space and the development of V-LCM for multilingual vision-language tasks are novel contributions.

Findings

01

V-SONAR achieves competitive text-to-video retrieval performance.

02

V-SONAR surpasses state-of-the-art in video captioning tasks.

03

V-LCM outperforms existing models in multilingual vision-language tasks.

Abstract

We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling