Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration

Arundhathi Dev; Justin Zhan

arXiv:2603.18417·cs.LG·March 20, 2026

Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration

Arundhathi Dev, Justin Zhan

PDF

Open Access

TL;DR

This paper introduces AFBS-BO, an automated hyperparameter optimization framework for sparse attention in transformers, significantly reducing tuning effort and improving performance over existing methods.

Contribution

The paper presents AFBS-BO, a hybrid Bayesian optimization and binary search method that automates layer-specific hyperparameter tuning for sparse attention, enabling plug-and-play acceleration.

Findings

01

Accelerates hyperparameter discovery by 3.4x compared to grid search.

02

Achieves 8.8x fewer evaluations than traditional methods.

03

Outperforms existing sparse attention baselines while matching dense attention quality.

Abstract

Sparse attention mechanisms promise to break the quadratic bottleneck of long-context transformers, yet production adoption remains limited by a critical usability gap: optimal hyperparameters vary substantially across layers and models, and current methods (e.g., SpargeAttn) rely on manual grid search to identify them. We propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that discovers optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement, leveraging multi-fidelity evaluation across sequence lengths to reduce tuning cost. On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x with 8.8x fewer evaluations than grid search, and identifies high-sparsity configurations that outperform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques