Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Chuwei Luo; Guozhi Tang; Qi Zheng; Cong Yao; Lianwen Jin; Chenliang Li; Yang Xue; Luo Si

arXiv:2206.13155·cs.CV·June 19, 2025·6 cites

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Chuwei Luo, Guozhi Tang, Qi Zheng, Cong Yao, Lianwen Jin, Chenliang Li, Yang Xue, Luo Si

PDF

Open Access

TL;DR

Bi-VLDoc introduces a bidirectional vision-language pre-training paradigm with hybrid-attention to improve the understanding of visually-rich documents, significantly enhancing performance across multiple document understanding benchmarks.

Contribution

It proposes a novel bidirectional supervision strategy and hybrid-attention mechanism for better cross-modal representation learning in document understanding.

Findings

01

Achieved state-of-the-art results on Form Understanding, Receipt Information Extraction, and Document Classification.

02

Significantly improved performance on Document Visual QA.

03

Demonstrated the effectiveness of bidirectional supervision and hybrid-attention in multi-modal document modeling.

Abstract

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard benchmarks for VrDU, the way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy. In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals. Specifically, a pre-training paradigm called Bi-VLDoc is proposed, in which a bidirectional vision-language supervision strategy and a vision-language hybrid-attention mechanism are devised to fully explore and utilize the interactions between these two modalities, to learn stronger cross-modal document representations with richer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Natural Language Processing Techniques