Adaptive Head Budgeting for Efficient Multi-Head Attention

Bilal Faye; Abdoulaye Mbaye; Hanane Azzag; Mustapha Lebbah

arXiv:2604.22583·cs.LG·April 27, 2026

Adaptive Head Budgeting for Efficient Multi-Head Attention

Bilal Faye, Abdoulaye Mbaye, Hanane Azzag, Mustapha Lebbah

PDF

TL;DR

This paper introduces BudgetFormer, an adaptive multi-head attention Transformer that dynamically allocates attention heads based on input complexity, reducing computational costs while maintaining or improving performance.

Contribution

The paper presents a novel adaptive attention mechanism with a training strategy for dynamic head allocation, enhancing efficiency and effectiveness in Transformer models.

Findings

01

Reduces inference FLOPs and memory usage.

02

Achieves comparable or better performance than standard multi-head attention.

03

Effectively adapts to varying input complexities in text classification.

Abstract

Transformers have become the dominant architecture across a wide range of domains, largely due to the effectiveness of multi-head attention in capturing diverse representation subspaces. However, standard multi-head attention activates all heads uniformly for every input, regardless of task requirements or input complexity. In many scenarios, particularly for coarse-grained tasks such as text classification, the relevant information is often global and does not require the full diversity of attention heads. As a consequence, using a fixed number of heads can introduce unnecessary computational cost or lead to suboptimal performance when the allocation does not match the input. To address this limitation, we introduce BudgetFormer, a Transformer architecture equipped with an adaptive multi-head attention mechanism that dynamically allocates computational resources. Our approach learns,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.