Multi-Modal Association based Grouping for Form Structure Extraction

Milan Aggarwal; Mausoom Sarkar; Hiresh Gupta; Balaji Krishnamurthy

arXiv:2107.04396·cs.CV·July 12, 2021

Multi-Modal Association based Grouping for Form Structure Extraction

Milan Aggarwal, Mausoom Sarkar, Hiresh Gupta, Balaji Krishnamurthy

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel multi-modal deep learning approach combining textual, spatial, and visual features to extract form structures, significantly improving over existing semantic segmentation methods.

Contribution

It presents a new multi-modal method for form structure extraction, utilizing a BiLSTM and CNN fusion, and introduces a comprehensive annotated Forms Dataset.

Findings

01

Achieved over 90% recall for TextBlocks

02

Outperformed semantic segmentation baselines

03

Validated effectiveness through ablation studies

Abstract

Document structure extraction has been a widely researched area for decades. Recent work in this direction has been deep learning-based, mostly focusing on extracting structure using fully convolution NN through semantic segmentation. In this work, we present a novel multi-modal approach for form structure extraction. Given simple elements such as textruns and widgets, we extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups, which are essential for information collection in forms. To achieve this, we obtain a local image patch around each low-level element (reference) by identifying candidate elements closest to it. We process textual and spatial representation of candidates sequentially through a BiLSTM to obtain context-aware representations and fuse them with image patch features obtained by processing it through a CNN. Subsequently, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MMPAN-forms/MMPAN
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Music and Audio Processing · Image Retrieval and Classification Techniques

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Convolution · Bidirectional LSTM