Constructing Multimodal Datasets from Scratch for Rapid Development of a   Japanese Visual Language Model

Keito Sasagawa; Koki Maeda; Issa Sugiura; Shuhei Kurita; Naoaki; Okazaki; Daisuke Kawahara

arXiv:2410.22736·cs.CL·October 31, 2024

Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Keito Sasagawa, Koki Maeda, Issa Sugiura, Shuhei Kurita, Naoaki, Okazaki, Daisuke Kawahara

PDF

Open Access 1 Video

TL;DR

This paper presents a method for rapidly creating Japanese multimodal datasets from scratch, enabling the development of high-performing Japanese Visual Language Models that outperform models using translated data.

Contribution

It introduces a novel approach to quickly generate native Japanese multimodal datasets, addressing the lack of non-English resources for VLM development.

Findings

01

VLM trained on native datasets outperforms machine-translated data models.

02

Collected Japanese image-text pairs and interleaved data effectively.

03

Generated instruction data from images enhances model performance.

Abstract

To develop high-performing Visual Language Models (VLMs), it is essential to prepare multimodal resources, such as image-text pairs, interleaved data, and instruction data. While multimodal resources for English are abundant, there is a significant lack of corresponding resources for non-English languages, such as Japanese. To address this problem, we take Japanese as a non-English language and propose a method for rapidly creating Japanese multimodal datasets from scratch. We collect Japanese image-text pairs and interleaved data from web archives and generate Japanese instruction data directly from images using an existing VLM. Our experimental results show that a VLM trained on these native datasets outperforms those relying on machine-translated content.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model· underline

Taxonomy

TopicsSubtitles and Audiovisual Media · Educational Tools and Methods · Speech and dialogue systems