Towards Cross-Lingual Explanation of Artwork in Large-scale Vision   Language Models

Shintaro Ozaki; Kazuki Hayashi; Yusuke Sakai; Hidetaka Kamigaito,; Katsuhiko Hayashi; Taro Watanabe

arXiv:2409.01584·cs.CL·February 17, 2025

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

Shintaro Ozaki, Kazuki Hayashi, Yusuke Sakai, Hidetaka Kamigaito,, Katsuhiko Hayashi, Taro Watanabe

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper investigates the multilingual explanation capabilities of large-scale vision-language models, introduces a culturally nuanced dataset in multiple languages, and evaluates how English instruction-tuning impacts cross-lingual performance.

Contribution

It presents a new multilingual dataset that accounts for cultural nuances, and analyzes the effects of instruction-tuning on non-English explanation generation in LVLMs.

Findings

01

LVLMs perform worse in non-English languages.

02

Models struggle to transfer English-trained knowledge to other languages.

03

Instruction-tuning in English improves but does not fully close the performance gap.

Abstract

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

naist-nlp/MultiExpArt
dataset· 45 dl
45 dl

Videos

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications