Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu

TL;DR
This paper introduces a comprehensive benchmark for evaluating attention mechanisms in long-context large language models, focusing on kernel efficiency and distributed parallelism to guide future research and deployment.
Contribution
It presents a unified, extensible benchmark that systematically compares operator-level and module-level attention strategies across various contexts and scales.
Findings
Kernel optimizations improve attention speed for long sequences.
Distributed context parallelism enhances scalability across multiple GPUs.
Benchmark reveals trade-offs between efficiency, scalability, and usability.
Abstract
Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel…
Peer Reviews
Decision·ICLR 2026 Poster
1. LongCA-bench unifies diverse attention kernels (dense/sparse) and distributed mechanisms under a modular interface, enabling fair cross-method comparisons—addressing the fragmentation of existing evaluations. 2. It systematically explores two understudied but impactful dimensions: 14 attention mask patterns (static/dynamic, regular/heterogeneous) and extreme long sequences (up to 512K) with large-scale distributed training (up to 96 GPUs), filling gaps in prior work. 3. The benchmark uses rea
0. Why no linear attention kernels? 1. The optimized distributed attention mechanisms only support 4 mask patterns (FULL, CAUSAL, FULL/CAUSAL DOCUMENT), excluding heterogeneous and dynamic masks—restricting its applicability to complex long-context tasks. 2. The benchmark excludes FlexAttention from full evaluations due to severe out-of-memory issues, and most sparse kernels lack backward computation support or flexibility (e.g., fixed block sizes), limiting insights into trainable sparse attent
(1) Extensive Method Integration: This work uses a unified interface to integrate 12 representative attention kernels and 5 distributed mechanisms. (2) Good Scalability: The evaluation is conducted on scenarios with sequence lengths up to 512K and across 96 GPUs. (3) Practical Insights: Through experimental evaluation, the authors obtain insightful conclusions regarding the impact of mask patterns, the trade-offs between kernel efficiency and usability, and the scalability characteristics of dif
(1) Architectural Limitation: The study is limited to the Hopper architecture and does not discuss the generalization of experimental conclusions to other architectures. (2) Performance Metric Limitation: The research only focuses on throughput and memory usage as performance metrics. It does not analyze how metrics such as memory bandwidth utilization and inter-node communication load vary over time across different kernels and distributed mechanisms.
1. **Comprehensive benchmarking.** The paper systematically benchmarks a wide range of attention implementations, including dense, sparse, and distributed mechanisms, under a unified framework. The experimental coverage (up to 96 GPUs and multiple mask types) is extensive and provides a clear view of current attention efficiency trends. 2. **Sound experimental methodology.** The experiments are well-organized, use realistic settings (e.g., long context lengths, different mask patterns), and repo
1. **Limited Analysis.** The paper reports extensive throughput and memory results, but offers limited discussion on the underlying causes of observed performance trends of the benchmarked methods. 2. **Incomplete coverage of the most critical setting — distributed sparse attention.** The integration of sparse attention (particularly dynamic block-sparse attention with TopK/TopP selection criterion) into distributed contexts remains an unexplored and practically important challenge. The paper do
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Big Data and Digital Economy · Multimodal Machine Learning Applications
