LiGT: Layout-infused Generative Transformer for Visual Question   Answering on Vietnamese Receipts

Thanh-Phong Le; Trung Le Chi Phan; Nghia Hieu Nguyen; Kiet Van Nguyen

arXiv:2502.19202·cs.CL·March 10, 2025

LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts

Thanh-Phong Le, Trung Le Chi Phan, Nghia Hieu Nguyen, Kiet Van Nguyen

PDF

Open Access

TL;DR

This paper introduces ReceiptVQA, a large-scale Vietnamese receipt dataset for document VQA, and proposes LiGT, a layout-aware generative transformer model that effectively handles multimodal inputs for improved question answering.

Contribution

The paper presents the first large-scale Vietnamese receipt dataset and a novel layout-infused generative transformer architecture for multimodal document VQA.

Findings

01

LiGT achieves competitive results on ReceiptVQA.

02

Encoder-only models have limitations compared to generative architectures.

03

Multimodal integration is essential for effective Vietnamese receipt VQA.

Abstract

Document Visual Question Answering (Document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf{9,000+} receipt images and \textbf{60,000+} manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling