Long Context Pre-Training with Lighthouse Attention

Bowen Peng; Subho Ghosh; Jeffrey Quesnelle

arXiv:2605.06554·cs.CL·May 8, 2026

Long Context Pre-Training with Lighthouse Attention

Bowen Peng, Subho Ghosh, Jeffrey Quesnelle

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces Lighthouse Attention, a hierarchical, selection-based attention method that accelerates training of causal transformers at long sequence lengths by reducing quadratic complexity, with a two-stage training process.

Contribution

The authors propose a novel hierarchical, gradient-free attention mechanism that enables faster training of long-sequence transformers and can be seamlessly removed for full attention recovery.

Findings

01

Faster training time compared to full attention with similar settings.

02

Lower final loss after recovery phase in preliminary experiments.

03

Effective adaptive compression and decompression of sequences during training.

Abstract

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ighoshsubho/lighthouse-attention
github

Datasets

aoiandroid/papers
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.