StructFormer: Document Structure-based Masked Attention and its Impact   on Language Model Pre-Training

Kaustubh Ponkshe; Venkatapathy Subramanian; Natwar Modani; Ganesh; Ramakrishnan

arXiv:2411.16618·cs.CL·November 26, 2024

StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training

Kaustubh Ponkshe, Venkatapathy Subramanian, Natwar Modani, Ganesh, Ramakrishnan

PDF

Open Access

TL;DR

This paper investigates how incorporating document structure into transformer-based language models through a new masking attention mechanism affects pre-training and downstream task performance, emphasizing the importance of structure-aware training.

Contribution

It introduces a structure-aware masking attention mechanism for transformers and empirically evaluates its impact on BERT pre-training and document understanding tasks.

Findings

01

Global attention influences attention patterns during pre-training

02

Structure-aware pre-training improves document understanding performance

03

Incorporating document structure enhances model abstraction capabilities

Abstract

Most state-of-the-art techniques for Language Models (LMs) today rely on transformer-based architectures and their ubiquitous attention mechanism. However, the exponential growth in computational requirements with longer input sequences confines Transformers to handling short passages. Recent efforts have aimed to address this limitation by introducing selective attention mechanisms, notably local and global attention. While sparse attention mechanisms, akin to full attention in being Turing-complete, have been theoretically established, their practical impact on pre-training remains unexplored. This study focuses on empirically assessing the influence of global attention on BERT pre-training. The primary steps involve creating an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. We carry out pre-training on these two datasets, investigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Adam · Residual Connection · Weight Decay · Softmax · Attention Is All You Need · Multi-Head Attention · Dense Connections · Dropout