Deep Learning based Visually Rich Document Content Understanding: A Survey

Yihao Ding; Soyeon Caren Han; Jean Lee; Eduard Hovy

arXiv:2408.01287·cs.CL·June 23, 2025

Deep Learning based Visually Rich Document Content Understanding: A Survey

Yihao Ding, Soyeon Caren Han, Jean Lee, Eduard Hovy

PDF

Open Access

TL;DR

This survey reviews deep learning methods for understanding visually rich documents, emphasizing multimodal models that combine text, layout, and visual cues to improve information extraction across various domains.

Contribution

It provides a comprehensive categorization, comparison, and analysis of deep learning frameworks for VRD content understanding, highlighting strengths, limitations, and future directions.

Findings

01

Deep learning models significantly enhance VRD content understanding.

02

Multimodal fusion techniques are crucial for effective information extraction.

03

Pretraining strategies improve model performance and generalization.

Abstract

Visually Rich Documents (VRDs) play a vital role in domains such as academia, finance, healthcare, and marketing, as they convey information through a combination of text, layout, and visual elements. Traditional approaches to extracting information from VRDs rely heavily on expert knowledge and manual annotation, making them labor-intensive and inefficient. Recent advances in deep learning have transformed this landscape by enabling multimodal models that integrate vision, language, and layout features through pretraining, significantly improving information extraction performance. This survey presents a comprehensive overview of deep learning-based frameworks for VRD Content Understanding (VRD-CU). We categorize existing methods based on their modeling strategies and downstream tasks, and provide a comparative analysis of key components, including feature representation, fusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Digital Media Forensic Detection