Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration
Arundhathi Dev, Justin Zhan

TL;DR
This paper introduces AFBS-BO, an automated hyperparameter optimization framework for sparse attention in transformers, significantly reducing tuning effort and improving performance over existing methods.
Contribution
The paper presents AFBS-BO, a hybrid Bayesian optimization and binary search method that automates layer-specific hyperparameter tuning for sparse attention, enabling plug-and-play acceleration.
Findings
Accelerates hyperparameter discovery by 3.4x compared to grid search.
Achieves 8.8x fewer evaluations than traditional methods.
Outperforms existing sparse attention baselines while matching dense attention quality.
Abstract
Sparse attention mechanisms promise to break the quadratic bottleneck of long-context transformers, yet production adoption remains limited by a critical usability gap: optimal hyperparameters vary substantially across layers and models, and current methods (e.g., SpargeAttn) rely on manual grid search to identify them. We propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that discovers optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement, leveraging multi-fidelity evaluation across sequence lengths to reduce tuning cost. On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x with 8.8x fewer evaluations than grid search, and identifies high-sparsity configurations that outperform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
