Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

Giansalvo Cirrincione

arXiv:2604.08829·cs.LG·April 13, 2026

Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

Giansalvo Cirrincione

PDF

TL;DR

The paper introduces the Hierarchical Kernel Transformer (HKT), a multi-scale attention mechanism with theoretical guarantees and improved performance over standard attention in various tasks.

Contribution

HKT provides a novel multi-scale attention framework with theoretical analysis and practical improvements, including bounded computational cost and superior results.

Findings

01

HKT achieves +4.77pp on ListOps

02

HKT improves accuracy by +1.44pp on CIFAR-10

03

HKT gains +7.47pp on IMDB sentiment

Abstract

The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.