Constructing Multilingual Visual-Text Datasets Revealing Visual   Multilingual Ability of Vision Language Models

Jesse Atuhurra; Iqra Ali; Tatsuya Hiraoka; Hidetaka Kamigaito; Tomoya; Iwakura; Taro Watanabe

arXiv:2406.15359·cs.CL·June 25, 2024

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Jesse Atuhurra, Iqra Ali, Tatsuya Hiraoka, Hidetaka Kamigaito, Tomoya, Iwakura, Taro Watanabe

PDF

Open Access

TL;DR

This paper develops multilingual visual-text datasets across four languages to systematically evaluate vision-language models, introducing new tasks, rationales, and human evaluation to analyze their fine-grained visual linguistic abilities.

Contribution

It creates new multilingual datasets with rationales for comprehensive VLM evaluation, including the novel 'unrelatedness' task and analysis in Swahili and Urdu.

Findings

01

VLMs can be fine-tuned on the new datasets

02

Rationales improve understanding of VLM reasoning

03

First analysis of Swahili and Urdu VLM capabilities

Abstract

Large language models (LLMs) have increased interest in vision language models (VLMs), which process image-text pairs as input. Studies investigating the visual understanding ability of VLMs have been proposed, but such studies are still preliminary because existing datasets do not permit a comprehensive evaluation of the fine-grained visual linguistic abilities of VLMs across multiple languages. To further explore the strengths of VLMs, such as GPT-4V \cite{openai2023GPT4}, we developed new datasets for the systematic and qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced nine vision-and-language (VL) tasks (including object recognition, image-text matching, and more) and constructed multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu through utilizing templates containing \textit{questions} and prompting GPT4-V to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition