KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document
Nimol Thuon, Jun Du

TL;DR
This paper introduces KH-FUNSD, a hierarchical, annotated dataset for Khmer business documents, enabling improved layout analysis and information extraction for a low-resource, non-Latin script language.
Contribution
It presents the first publicly available Khmer document dataset with multi-level annotations, supporting layout analysis and information extraction in low-resource settings.
Findings
Benchmark results establish baseline performance for Khmer document analysis.
The dataset reveals unique challenges of non-Latin, low-resource scripts.
Hierarchical annotation improves layout understanding and entity recognition.
Abstract
Automated document layout analysis remains a major challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present \textbf{KH-FUNSD}, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Multimodal Machine Learning Applications
