Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
Giansalvo Cirrincione

TL;DR
The paper introduces the Hierarchical Kernel Transformer (HKT), a multi-scale attention mechanism with theoretical guarantees and improved performance over standard attention in various tasks.
Contribution
HKT provides a novel multi-scale attention framework with theoretical analysis and practical improvements, including bounded computational cost and superior results.
Findings
HKT achieves +4.77pp on ListOps
HKT improves accuracy by +1.44pp on CIFAR-10
HKT gains +7.47pp on IMDB sentiment
Abstract
The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
