Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Quantong Qiu; Zhiyi Hong; Yi Yang; Haitian Wang; Kebin Liu; Qingqing Dang; Juntao Li; Min Zhang

arXiv:2604.07394·cs.LG·April 10, 2026

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Quantong Qiu, Zhiyi Hong, Yi Yang, Haitian Wang, Kebin Liu, Qingqing Dang, Juntao Li, Min Zhang

PDF

1 Repo 10 Models 2 Datasets

TL;DR

Flux Attention introduces a dynamic, context-aware hybrid attention mechanism that adaptively balances full and sparse attention at the layer level, significantly improving inference speed for long-context LLMs.

Contribution

The paper proposes a lightweight Layer Router that enables layer-wise adaptive attention routing, enhancing efficiency without sacrificing retrieval quality in pretrained LLMs.

Findings

01

Achieves up to 2.8x speedup in prefill and 2.0x in decoding stages.

02

Requires only 12 hours of training on 8×A800 GPUs.

03

Demonstrates superior performance-speed trade-offs on multiple benchmarks.

Abstract

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qqtang-code/FluxAttention
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.