HIP: Hierarchical Point Modeling and Pre-training for Visual Information   Extraction

Rujiao Long; Pengfei Wang; Zhibo Yang; Cong Yao

arXiv:2411.01139·cs.CV·November 5, 2024

HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

Rujiao Long, Pengfei Wang, Zhibo Yang, Cong Yao

PDF

Open Access

TL;DR

HIP introduces a hierarchical point-based model with pre-training strategies for end-to-end visual information extraction, effectively handling hierarchical subtasks and improving interpretability and performance over previous methods.

Contribution

The paper proposes HIP, a novel hierarchical point modeling approach with pre-training strategies, enhancing end-to-end VIE by better capturing hierarchical structure and reducing OCR dependency.

Findings

01

HIP outperforms state-of-the-art methods on benchmarks.

02

Qualitative results demonstrate high interpretability.

03

Hierarchical pre-training improves cross-modality representation.

Abstract

End-to-end visual information extraction (VIE) aims at integrating the hierarchical subtasks of VIE, including text spotting, word grouping, and entity labeling, into a unified framework. Dealing with the gaps among the three subtasks plays a pivotal role in designing an effective VIE model. OCR-dependent methods heavily rely on offline OCR engines and inevitably suffer from OCR errors, while OCR-free methods, particularly those employing a black-box model, might produce outputs that lack interpretability or contain hallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task. Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Robotics and Sensor-Based Localization

MethodsPointwise Convolution · Batch Normalization · Hierarchical Feature Fusion · Dilated Convolution · Efficient Spatial Pyramid · Convolution · Cascade Corner Pooling · Deep Layer Aggregation · Center Pooling · CenterNet