Harnessing PDF Data for Improving Japanese Large Multimodal Models
Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa

TL;DR
This paper introduces an automated method to extract image-text pairs from Japanese PDFs to enhance large multimodal models, significantly improving their performance on Japanese benchmarks by leveraging underutilized PDF data.
Contribution
We develop a fully automated pipeline for extracting and utilizing Japanese PDF data to improve multimodal model training, addressing data scarcity and cultural knowledge gaps.
Findings
Performance gains of 2.1% to 13.8% on Heron-Bench.
PDF data significantly enhances Japanese LMMs.
Automated extraction pipeline reduces manual effort.
Abstract
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Geographic Information Systems Studies
