DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives
Mohammad Rifqi Farhansyah, Muhammad Zuhdi Fikri Johari, Afinzaki, Amiral, Ayu Purwarianti, Kumara Ari Yuana, Derry Tanti Wijaya

TL;DR
This paper introduces DriveThru, a platform that digitizes Indonesian language documents using OCR and LLMs, facilitating scalable resource creation for underrepresented languages in Indonesia.
Contribution
It presents a novel document digitization platform that leverages OCR and LLMs for Indonesian languages, addressing scalability issues in resource development.
Findings
DriveThru effectively extracts content from Indonesian documents.
LLMs improve OCR accuracy in post-processing.
The platform reduces manual effort and costs in language resource creation.
Abstract
Indonesia is one of the most diverse countries linguistically. However, despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing (NLP) research and technologies. In the past two years, several efforts have been conducted to construct NLP resources for Indonesian languages. However, most of these efforts have been focused on creating manual resources thus difficult to scale to more languages. Although many Indonesian languages do not have a web presence, locally there are resources that document these languages well in printed forms such as books, magazines, and newspapers. Digitizing these existing resources will enable scaling of Indonesian language resource construction to many more languages. In this paper, we propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
