Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image   Models

Lei Wang; Jiabang He; Xing Xu; Ning Liu; Hui Liu

arXiv:2211.14777·cs.CV·December 2, 2022

Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

Lei Wang, Jiabang He, Xing Xu, Ning Liu, Hui Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces AETNet, a fine-tuning approach that incorporates alignment-aware contrastive objectives and additional transformers to improve patch-level document image understanding, achieving state-of-the-art results.

Contribution

Proposes a novel alignment-enriched tuning framework with extra transformers and contrastive objectives for better downstream task adaptation of pre-trained document image models.

Findings

01

AETNet outperforms existing models like LayoutLMv3 on multiple tasks.

02

Incorporating alignment objectives improves patch-level and structural understanding.

03

Fine-tuning with alignment-aware contrastive learning enhances downstream performance.

Abstract

Alignment between image and text has shown promising improvements on patch-level pre-trained document image models. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question naturally arises: Could we fine-tune the pre-trained models adaptive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we propose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific supervised and alignment-aware contrastive objective. Specifically, we introduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment-ware text encoder before multimodal fusion. We consider alignment in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maehcm/aet
pytorchOfficial

Videos

Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques