TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

Andr\'e G. Viveiros; Patrick Fernandes; Saul Santos; Sonal Sannigrahi; Emmanouil Zaranis; Nuno M. Guerreiro; Amin Farajian; Pierre Colombo; Graham Neubig; Andr\'e F. T. Martins

arXiv:2510.21849·cs.LG·November 7, 2025

TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

Andr\'e G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, Andr\'e F. T. Martins

PDF

4 Models

TL;DR

This paper introduces TowerVision, a family of multilingual vision-language models built on Tower+, demonstrating improved cross-lingual performance and cultural understanding in multimodal tasks, supported by extensive empirical analysis and new datasets.

Contribution

The paper presents TowerVision, a novel multilingual VLM family, along with insights on multilingual training data, model initialization, and a new curated dataset, advancing multilingual multimodal understanding.

Findings

01

Multilingual training data enhances cross-lingual generalization.

02

Instruction-tuned LLMs are not always optimal for initialization.

03

Visual and cultural context improves performance on culturally grounded tasks.

Abstract

Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.