Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
Shintaro Ozaki, Kazuki Hayashi, Yusuke Sakai, Hidetaka Kamigaito,, Katsuhiko Hayashi, Taro Watanabe

TL;DR
This paper investigates the multilingual explanation capabilities of large-scale vision-language models, introduces a culturally nuanced dataset in multiple languages, and evaluates how English instruction-tuning impacts cross-lingual performance.
Contribution
It presents a new multilingual dataset that accounts for cultural nuances, and analyzes the effects of instruction-tuning on non-English explanation generation in LVLMs.
Findings
LVLMs perform worse in non-English languages.
Models struggle to transfer English-trained knowledge to other languages.
Instruction-tuning in English improves but does not fully close the performance gap.
Abstract
As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
