Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Xiang Yue; Yueqi Song; Akari Asai; Seungone Kim; Jean de Dieu Nyandwi,; Simran Khanuja; Anjali Kantharuban; Lintang Sutawika; Sathyanarayanan; Ramamoorthy; Graham Neubig

arXiv:2410.16153·cs.CL·January 28, 2025

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi,, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan, Ramamoorthy, Graham Neubig

PDF

Open Access 2 Models 2 Datasets 3 Reviews

TL;DR

Pangea is a fully open, multilingual multimodal large language model trained on 39 languages, demonstrating superior performance across diverse cultural and linguistic contexts, and promoting inclusivity in AI.

Contribution

Introduces Pangea, a novel multilingual multimodal LLM trained on a diverse dataset, with a comprehensive evaluation suite, advancing inclusive AI development.

Findings

01

Pangea outperforms existing open-source models in multilingual tasks.

02

English data proportion and cultural relevance impact model performance.

03

Open-sourcing promotes inclusive AI development.

Abstract

Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. This paper addressed an important research problem on multilingual multimodal learning, by providing a large instruction dataset and evaluation suite that covers multiple languages with the corresponding culture reference annotations. This data could be of interest to a broad audience in the research community. 2. The paper is well written and easy to follow. The proposed PANGEAINS instruction data and the PANGEABENCH evaluation suite were described with good amount of details. The authors f

Weaknesses

1. It could be helpful to a section to discuss related multilingual multimodal datasets (e.g. M3IT etc), and show how PANGEA compares to them. This could further illustrate the significance of the proposed data here. 2. Although the authors highlighted the introduction of PANGEA, the discussion on the modeling aspect of PANGEA is light. It seems to be a directly application of a LLaVA-Next based model trained using the new data proposed. It could help to further discuss the training specifics. 3

Reviewer 02Rating 8Confidence 4

Strengths

A. This paper includes a very significant set of contributions which will be made public together with the code: 1. PangeaIns with 6M instructions from 39 languages that include (1) high-quality English instructions, (2) machine-translated instructions, (3) culturally relevant multimodal tasks 2. PangeaBench, a multimodal evaluation suite consists of 14 diverse datasets covering 47 languages 3. Pangea an MLLM covering 39 languages outperforms existing open-source models by 7.3 points on English

Weaknesses

There are only two minor weaknesses mainly focusing on the need for clarity and the presentation: A. In Section 2.1, the paper mentions a post-processing pipeline for noisy translations, however, details are missing. This might be quite important for practitioners to build similar models. Therefore, I suggest authors to include more details there. B. In Section 4.1, it has been mentioned that the model is based on Llava-Next architecture but this needs to be detailed. For example, does the tra

Reviewer 03Rating 5Confidence 3

Strengths

Originality The paper presents approaches to multilingual and multimodal learning through the development of the PANGEAINS dataset and the PANGEABENCH evaluation framework. By focusing on both linguistic and cultural diversity, it addresses gaps in existing datasets that often overlook these critical aspects. The integration of machine translation and culturally relevant guidelines for data curation showcases an innovative methodology that enhances the relevance and applicability of the dataset.

Weaknesses

Limitations of the Dataset The PANGEAINS dataset comprises samples from 39 languages; however, there exists a potential imbalance in both the quantity and quality of data for specific languages. This issue is particularly pronounced for low-resource languages that the authors do not explicitly address. The scarcity of data for these languages can lead to significant challenges in training models that are capable of performing effectively across diverse linguistic contexts. Insufficient represent

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Translation Studies and Practices · Topic Modeling