Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Logan Hallee, Jason P. Gleghorn

TL;DR
Dual Triangle Attention introduces a bidirectional attention mechanism that encodes positional information without explicit positional embeddings, enhancing sequence modeling across various domains.
Contribution
It proposes a novel bidirectional attention method using complementary triangular masks, eliminating the need for positional embeddings in transformers.
Findings
Dual Triangle Attention learns positional information without explicit embeddings.
Performs well in masked language modeling tasks on language and protein data.
Achieves the best context extension performance with Rotary Positional Embeddings.
Abstract
Bidirectional transformers are the foundation of many sequence modeling tasks across natural, biological, and chemical language domains, but they are permutation-invariant without explicit positional embeddings. In contrast, unidirectional attention inherently encodes positional information through its triangular mask, enabling models to operate without positional embeddings altogether. Here, we introduce Dual Triangle Attention, a novel bidirectional attention mechanism that separates the query-key subspace of each attention head into two complementary triangular masks: one that attends to past-and-self positions and one that attends to future-and-self positions. This design provides bidirectional context while maintaining the causal mask's implicit positional inductive bias in both directions. Using PyTorch's flex_attention, Dual Triangle Attention is implemented as a single compiled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
