Deep Learning based Visually Rich Document Content Understanding: A Survey
Yihao Ding, Soyeon Caren Han, Jean Lee, Eduard Hovy

TL;DR
This survey reviews deep learning methods for understanding visually rich documents, emphasizing multimodal models that combine text, layout, and visual cues to improve information extraction across various domains.
Contribution
It provides a comprehensive categorization, comparison, and analysis of deep learning frameworks for VRD content understanding, highlighting strengths, limitations, and future directions.
Findings
Deep learning models significantly enhance VRD content understanding.
Multimodal fusion techniques are crucial for effective information extraction.
Pretraining strategies improve model performance and generalization.
Abstract
Visually Rich Documents (VRDs) play a vital role in domains such as academia, finance, healthcare, and marketing, as they convey information through a combination of text, layout, and visual elements. Traditional approaches to extracting information from VRDs rely heavily on expert knowledge and manual annotation, making them labor-intensive and inefficient. Recent advances in deep learning have transformed this landscape by enabling multimodal models that integrate vision, language, and layout features through pretraining, significantly improving information extraction performance. This survey presents a comprehensive overview of deep learning-based frameworks for VRD Content Understanding (VRD-CU). We categorize existing methods based on their modeling strategies and downstream tasks, and provide a comparative analysis of key components, including feature representation, fusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Digital Media Forensic Detection
