BFLA: Block-Filtered Long-Context Attention Mechanism

Chong Wu; Zhenan Feng; Renjie Xu; Houwang Zhang; Jiawang Cao; Maolin Che; Wenbo Zhu; Hong Yan

arXiv:2605.12193·eess.SP·May 13, 2026

BFLA: Block-Filtered Long-Context Attention Mechanism

Chong Wu, Zhenan Feng, Renjie Xu, Houwang Zhang, Jiawang Cao, Maolin Che, Wenbo Zhu, Hong Yan

PDF

1 Repo

TL;DR

BFLA introduces a training-free sparse attention mechanism that accelerates long-context inference in large language models with minimal accuracy loss.

Contribution

It presents a novel two-stage block-filtered attention method that can be integrated into existing models without retraining or modification.

Findings

01

BFLA significantly speeds up long-context prefilling.

02

Minimal accuracy degradation compared to dense attention methods.

03

Compatible with multiple large language model series.

Abstract

This paper proposes Block-Filtered Long-Context Attention (BFLA), a training-free sparse prefill attention mechanism for long-context inference. BFLA adopts a two-stage design. In Stage 1, query and key sequences are compressed into coarse blocks, and lightweight block-level softmax mass estimation is performed to construct an input-dependent block importance mask. In Stage 2, the coarse mask is expanded to the Triton attention-tile grid. Several tile-level rescue strategies are applied to reduce information loss, where a fused sparse prefill kernel skips unimportant KV tiles while preserving exact token-level attention inside every retained tile. BFLA requires no retraining, calibration, preprocessing, or model modification and can be plugged into existing vLLM-style paged-attention workloads. Experiments on Gemma 4, Llama 3.1, Qwen 3.5, and Qwen 3.6 series models show that BFLA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Alicewithrabbit/BFLA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.