Hierarchical Visual Feature Aggregation for OCR-Free Document   Understanding

Jaeyoo Park; Jin Young Choi; Jeonghyung Park; Bohyung Han

arXiv:2411.05254·cs.CV·November 11, 2024

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

Jaeyoo Park, Jin Young Choi, Jeonghyung Park, Bohyung Han

PDF

Open Access 1 Video

TL;DR

This paper introduces an OCR-free document understanding framework that uses multi-scale visual features and a hierarchical aggregation module to improve efficiency and accuracy in processing diverse document images with large language models.

Contribution

The paper proposes the Hierarchical Visual Feature Aggregation (HVFA) module and a new instruction tuning task, enhancing multi-scale visual feature integration and text reading capabilities in LLM-based document understanding.

Findings

01

HVFA reduces input tokens while maintaining information quality

02

The approach achieves superior performance on document understanding tasks

03

Effective handling of varying document image sizes

Abstract

We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for MLLMs, we propose the Hierarchical Visual Feature Aggregation (HVFA) module, designed to reduce the number of input tokens to LLMs. Leveraging a feature pyramid with cross-attentive pooling, our approach effectively manages the trade-off between information loss and efficiency without being affected by varying document image sizes. Furthermore, we introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text, eventually minimizing the risk of truncated text caused by the limited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding· slideslive

Taxonomy

TopicsHandwritten Text Recognition Techniques · Digital Media Forensic Detection · Digital and Cyber Forensics