Long Context Pre-Training with Lighthouse Attention
Bowen Peng, Subho Ghosh, Jeffrey Quesnelle

TL;DR
This paper introduces Lighthouse Attention, a hierarchical, selection-based attention method that accelerates training of causal transformers at long sequence lengths by reducing quadratic complexity, with a two-stage training process.
Contribution
The authors propose a novel hierarchical, gradient-free attention mechanism that enables faster training of long-sequence transformers and can be seamlessly removed for full attention recovery.
Findings
Faster training time compared to full attention with similar settings.
Lower final loss after recovery phase in preliminary experiments.
Effective adaptive compression and decompression of sequences during training.
Abstract
Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
