HDT: Hierarchical Document Transformer
Haoyu He, Markus Flicke, Jan Buchmann, Iryna Gurevych, Andreas Geiger

TL;DR
The paper introduces HDT, a hierarchical sparse Transformer that efficiently leverages document structure for improved performance and efficiency in processing structured documents across various domains.
Contribution
HDT is a novel hierarchical sparse Transformer architecture that explicitly exploits document structure through auxiliary tokens and a multi-level attention hierarchy.
Findings
Faster convergence on downstream tasks
Higher sample efficiency compared to existing models
Improved performance on structured document benchmarks
Abstract
In this paper, we propose the Hierarchical Document Transformer (HDT), a novel sparse Transformer architecture tailored for structured hierarchical documents. Such documents are extremely important in numerous domains, including science, law or medicine. However, most existing solutions are inefficient and fail to make use of the structure inherent to documents. HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy. This approach facilitates information exchange between tokens at different levels while maintaining sparsity, thereby enhancing computational and memory efficiency while exploiting the document structure as an inductive bias. We address the technical challenge of implementing HDT's sample-dependent hierarchical attention pattern by developing a novel sparse attention kernel that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications · Video Analysis and Summarization · Web Data Mining and Analysis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Multi-Head Attention · Softmax · Linear Warmup With Cosine Annealing · Residual Connection · Byte Pair Encoding
