Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi,, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan, Ramamoorthy, Graham Neubig

TL;DR
Pangea is a fully open, multilingual multimodal large language model trained on 39 languages, demonstrating superior performance across diverse cultural and linguistic contexts, and promoting inclusivity in AI.
Contribution
Introduces Pangea, a novel multilingual multimodal LLM trained on a diverse dataset, with a comprehensive evaluation suite, advancing inclusive AI development.
Findings
Pangea outperforms existing open-source models in multilingual tasks.
English data proportion and cultural relevance impact model performance.
Open-sourcing promotes inclusive AI development.
Abstract
Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper addressed an important research problem on multilingual multimodal learning, by providing a large instruction dataset and evaluation suite that covers multiple languages with the corresponding culture reference annotations. This data could be of interest to a broad audience in the research community. 2. The paper is well written and easy to follow. The proposed PANGEAINS instruction data and the PANGEABENCH evaluation suite were described with good amount of details. The authors f
1. It could be helpful to a section to discuss related multilingual multimodal datasets (e.g. M3IT etc), and show how PANGEA compares to them. This could further illustrate the significance of the proposed data here. 2. Although the authors highlighted the introduction of PANGEA, the discussion on the modeling aspect of PANGEA is light. It seems to be a directly application of a LLaVA-Next based model trained using the new data proposed. It could help to further discuss the training specifics. 3
A. This paper includes a very significant set of contributions which will be made public together with the code: 1. PangeaIns with 6M instructions from 39 languages that include (1) high-quality English instructions, (2) machine-translated instructions, (3) culturally relevant multimodal tasks 2. PangeaBench, a multimodal evaluation suite consists of 14 diverse datasets covering 47 languages 3. Pangea an MLLM covering 39 languages outperforms existing open-source models by 7.3 points on English
There are only two minor weaknesses mainly focusing on the need for clarity and the presentation: A. In Section 2.1, the paper mentions a post-processing pipeline for noisy translations, however, details are missing. This might be quite important for practitioners to build similar models. Therefore, I suggest authors to include more details there. B. In Section 4.1, it has been mentioned that the model is based on Llava-Next architecture but this needs to be detailed. For example, does the tra
Originality The paper presents approaches to multilingual and multimodal learning through the development of the PANGEAINS dataset and the PANGEABENCH evaluation framework. By focusing on both linguistic and cultural diversity, it addresses gaps in existing datasets that often overlook these critical aspects. The integration of machine translation and culturally relevant guidelines for data curation showcases an innovative methodology that enhances the relevance and applicability of the dataset.
Limitations of the Dataset The PANGEAINS dataset comprises samples from 39 languages; however, there exists a potential imbalance in both the quantity and quality of data for specific languages. This issue is particularly pronounced for low-resource languages that the authors do not explicitly address. The scarcity of data for these languages can lead to significant challenges in training models that are capable of performing effectively across diverse linguistic contexts. Insufficient represent
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Topic Modeling
