Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

Firas Gabetni; Giuseppe Curci; Andrea Pilzer; Subhankar Roy; Elisa Ricci; Gianni Franchi

arXiv:2510.18358·cs.LG·April 22, 2026

Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

Firas Gabetni, Giuseppe Curci, Andrea Pilzer, Subhankar Roy, Elisa Ricci, Gianni Franchi

PDF

1 Video

TL;DR

Hydra Ensembles is a novel, efficient transformer ensemble method that prunes attention heads and merges them to achieve strong uncertainty quantification with minimal computational overhead.

Contribution

We propose Hydra Ensembles, a pruning and merging technique for transformers that improves uncertainty quantification efficiently without retraining from scratch.

Findings

01

Hydra Ensembles match or surpass Deep Ensembles in UQ performance.

02

The method achieves inference speeds close to a single network.

03

It outperforms state-of-the-art methods in zero-shot ImageNet classification.

Abstract

Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers· slideslive