Multi-Modal Association based Grouping for Form Structure Extraction
Milan Aggarwal, Mausoom Sarkar, Hiresh Gupta, Balaji Krishnamurthy

TL;DR
This paper introduces a novel multi-modal deep learning approach combining textual, spatial, and visual features to extract form structures, significantly improving over existing semantic segmentation methods.
Contribution
It presents a new multi-modal method for form structure extraction, utilizing a BiLSTM and CNN fusion, and introduces a comprehensive annotated Forms Dataset.
Findings
Achieved over 90% recall for TextBlocks
Outperformed semantic segmentation baselines
Validated effectiveness through ablation studies
Abstract
Document structure extraction has been a widely researched area for decades. Recent work in this direction has been deep learning-based, mostly focusing on extracting structure using fully convolution NN through semantic segmentation. In this work, we present a novel multi-modal approach for form structure extraction. Given simple elements such as textruns and widgets, we extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups, which are essential for information collection in forms. To achieve this, we obtain a local image patch around each low-level element (reference) by identifying candidate elements closest to it. We process textual and spatial representation of candidates sequentially through a BiLSTM to obtain context-aware representations and fuse them with image patch features obtained by processing it through a CNN. Subsequently, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Music and Audio Processing · Image Retrieval and Classification Techniques
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Convolution · Bidirectional LSTM
