Harnessing PDF Data for Improving Japanese Large Multimodal Models

Jeonghun Baek; Akiko Aizawa; Kiyoharu Aizawa

arXiv:2502.14778·cs.CL·January 9, 2026

Harnessing PDF Data for Improving Japanese Large Multimodal Models

Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa

PDF

Open Access 1 Video

TL;DR

This paper introduces an automated method to extract image-text pairs from Japanese PDFs to enhance large multimodal models, significantly improving their performance on Japanese benchmarks by leveraging underutilized PDF data.

Contribution

We develop a fully automated pipeline for extracting and utilizing Japanese PDF data to improve multimodal model training, addressing data scarcity and cultural knowledge gaps.

Findings

01

Performance gains of 2.1% to 13.8% on Heron-Bench.

02

PDF data significantly enhances Japanese LMMs.

03

Automated extraction pipeline reduces manual effort.

Abstract

Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Harnessing PDF Data for Improving Japanese Large Multimodal Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Geographic Information Systems Studies