TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
Andr\'e G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, Andr\'e F. T. Martins

TL;DR
This paper introduces TowerVision, a family of multilingual vision-language models built on Tower+, demonstrating improved cross-lingual performance and cultural understanding in multimodal tasks, supported by extensive empirical analysis and new datasets.
Contribution
The paper presents TowerVision, a novel multilingual VLM family, along with insights on multilingual training data, model initialization, and a new curated dataset, advancing multilingual multimodal understanding.
Findings
Multilingual training data enhances cross-lingual generalization.
Instruction-tuned LLMs are not always optimal for initialization.
Visual and cultural context improves performance on culturally grounded tasks.
Abstract
Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
