PALO: A Polyglot Large Multimodal Model for 5B People

Muhammad Maaz; Hanoona Rasheed; Abdelrahman Shaker; Salman Khan,; Hisham Cholakal; Rao M. Anwer; Tim Baldwin; Michael Felsberg; Fahad S. Khan

arXiv:2402.14818·cs.CL·March 6, 2024·1 cites

PALO: A Polyglot Large Multimodal Model for 5B People

Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan,, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan

PDF

Open Access 1 Repo 3 Datasets

TL;DR

PALO is a large, multilingual vision-language model designed to provide visual reasoning in ten major languages, leveraging semi-automated translation and instruction tuning to enhance performance across diverse linguistic groups.

Contribution

This work introduces PALO, the first multilingual multimodal model trained on multiple languages with a scalable translation approach and a new benchmark for multilingual vision-language reasoning.

Findings

01

Substantial performance improvements over baselines.

02

Effective scalability across model sizes (1.7B, 7B, 13B).

03

Enhanced reasoning in underrepresented languages.

Abstract

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mbzuai-oryx/palo
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Mobility and Location-Based Analysis