Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Quantong Qiu, Zhiyi Hong, Yi Yang, Haitian Wang, Kebin Liu, Qingqing Dang, Juntao Li, Min Zhang

TL;DR
Flux Attention introduces a dynamic, context-aware hybrid attention mechanism that adaptively balances full and sparse attention at the layer level, significantly improving inference speed for long-context LLMs.
Contribution
The paper proposes a lightweight Layer Router that enables layer-wise adaptive attention routing, enhancing efficiency without sacrificing retrieval quality in pretrained LLMs.
Findings
Achieves up to 2.8x speedup in prefill and 2.0x in decoding stages.
Requires only 12 hours of training on 8×A800 GPUs.
Demonstrates superior performance-speed trade-offs on multiple benchmarks.
Abstract
The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗QQTang1223/full_streaming_Llama-3.1-8B-Instructmodel· 5 dl5 dl
- 🤗QQTang1223/full_streaming_Qwen3-4Bmodel· 4 dl4 dl
- 🤗QQTang1223/full_streaming_Qwen3-8Bmodel· 6 dl6 dl
- 🤗QQTang1223/full_triangle_Llama-3.1-8B-Instructmodel· 7 dl7 dl
- 🤗QQTang1223/full_triangle_Qwen3-4Bmodel· 4 dl4 dl
- 🤗QQTang1223/full_triangle_Qwen3-8Bmodel· 7 dl7 dl
- 🤗QQTang1223/full_xattn_Llama-3.1-8B-Instructmodel· 6 dl6 dl
- 🤗QQTang1223/full_xattn_Qwen3-4Bmodel· 6 dl6 dl
- 🤗QQTang1223/full_xattn_Qwen3-8Bmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗QQTang1223/xattn_streaming_Qwen3-4Bmodel· 59 dl59 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
