FormNetV2: Multimodal Graph Contrastive Learning for Form Document   Information Extraction

Chen-Yu Lee; Chun-Liang Li; Hao Zhang; Timothy Dozat; Vincent Perot,; Guolong Su; Xiang Zhang; Kihyuk Sohn; Nikolai Glushnev; Renshen Wang; Joshua; Ainslie; Shangbang Long; Siyang Qin; Yasuhisa Fujii; Nan Hua; Tomas Pfister

arXiv:2305.02549·cs.CL·June 14, 2023·1 cites

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

Chen-Yu Lee, Chun-Liang Li, Hao Zhang, Timothy Dozat, Vincent Perot,, Guolong Su, Xiang Zhang, Kihyuk Sohn, Nikolai Glushnev, Renshen Wang, Joshua, Ainslie, Shangbang Long, Siyang Qin, Yasuhisa Fujii, Nan Hua, Tomas Pfister

PDF

Open Access

TL;DR

FormNetV2 introduces a unified multimodal graph contrastive learning approach for form document understanding, achieving state-of-the-art results with a more efficient model by integrating visual and textual cues without complex multi-task tuning.

Contribution

It proposes a novel centralized contrastive learning strategy that unifies multimodal pre-training, simplifying the process and enhancing performance in form document information extraction.

Findings

01

Achieves new state-of-the-art on multiple benchmarks.

02

Uses targeted visual cues without separate image embedder.

03

Maintains a more compact model size.

Abstract

The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling

MethodsContrastive Learning