HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark
Aniket Pal, Ajoy Mondal, Minesh Mathew, C.V. Jawahar

TL;DR
This paper introduces HW-MLVQA, a comprehensive benchmark for multilingual handwritten document understanding that evaluates multimodal models across text, image, and combined modalities, addressing a critical gap in the field.
Contribution
It presents HW-MLVQA, a new benchmark with extensive handwritten pages and questions, and a framework for evaluating models without relying on ground truth transcriptions.
Findings
Benchmark includes 1,600 handwritten pages and 2,400 QA pairs.
Evaluates OCR models in real-world, transcription-free scenarios.
Facilitates advancements in multilingual handwritten document understanding.
Abstract
The proliferation of MultiLingual Visual Question Answering (MLVQA) benchmarks augments the capabilities of large language models (LLMs) and multi-modal LLMs, thereby enabling them to adeptly capture the intricate linguistic subtleties and visual complexities inherent across diverse languages. Despite its potential, the current MLVQA model struggles to fully utilize its capabilities when dealing with the extensive variety of handwritten documents. This article delineates HW-MLVQA, an avant-garde VQA benchmark meticulously crafted to mitigate the dearth of authentic Multilingual Handwritten document comprehension. HW-MLVQA encompasses an extensive collection of 1,600 handwritten Pages complemented by 2,400 question-answers. Furthermore, it provides a robust benchmark evaluation framework spanning three distinct modalities: text, image, and an integrated image & text modality. To simulate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
