Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models
Lei Wang, Jiabang He, Xing Xu, Ning Liu, Hui Liu

TL;DR
This paper introduces AETNet, a fine-tuning approach that incorporates alignment-aware contrastive objectives and additional transformers to improve patch-level document image understanding, achieving state-of-the-art results.
Contribution
Proposes a novel alignment-enriched tuning framework with extra transformers and contrastive objectives for better downstream task adaptation of pre-trained document image models.
Findings
AETNet outperforms existing models like LayoutLMv3 on multiple tasks.
Incorporating alignment objectives improves patch-level and structural understanding.
Fine-tuning with alignment-aware contrastive learning enhances downstream performance.
Abstract
Alignment between image and text has shown promising improvements on patch-level pre-trained document image models. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question naturally arises: Could we fine-tune the pre-trained models adaptive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we propose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific supervised and alignment-aware contrastive objective. Specifically, we introduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment-ware text encoder before multimodal fusion. We consider alignment in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques
