Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification
Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

TL;DR
This paper introduces a Hierarchical Multi-modal Transformer (HMT) that effectively models and classifies long documents with both text and images, outperforming existing methods by capturing complex cross-modal relationships.
Contribution
The paper presents a novel hierarchical multi-modal transformer architecture with a dynamic mask transfer module for improved long document classification involving text and images.
Findings
HMT outperforms state-of-the-art methods on multiple datasets.
The dynamic mask transfer module effectively integrates multi-scale features.
Hierarchical modeling captures complex cross-modal relationships.
Abstract
Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Web Data Mining and Analysis
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax
