Synthetic Document Question Answering in Hungarian

Jonathan Li; Zoltan Csaki; Nidhi Hiremath; Etash Guha; Fenglu Hong; Edward Ma; Urmish Thakker

arXiv:2505.23008·cs.CV·May 30, 2025

Synthetic Document Question Answering in Hungarian

Jonathan Li, Zoltan Csaki, Nidhi Hiremath, Etash Guha, Fenglu Hong, Edward Ma, Urmish Thakker

PDF

Open Access 1 Repo

TL;DR

This paper introduces Hungarian document VQA datasets, HuDocVQA and HuDocVQA-manual, along with HuCCPDF for OCR training, to improve multilingual document question answering in low-resource languages.

Contribution

The paper presents new synthetic and manual Hungarian document VQA datasets and a large OCR dataset, enabling better model training and evaluation in low-resource language settings.

Findings

01

Finetuning on proposed datasets improves VQA accuracy by +7.2%.

02

Datasets are publicly released for further research.

03

Methodology for dataset quality filtering and deduplication.

Abstract

Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

snova-jonathanl/hudocvqa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques

MethodsLLaMA · Sparse Evolutionary Training