Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi
Ella Koresh, Ronit D. Gross, Yuval Meir, Yarden Tzach, Tal Halevi, and, Ido Kanter

TL;DR
This paper reveals a unified underlying learning mechanism in CNNs and ViT architectures, showing how nodes identify label clusters and how MHA heads specialize, leading to efficient pruning and head cooperation.
Contribution
It introduces a quantitative SNP-based framework unifying CNNs and ViTs, demonstrating label cluster sharpening, head specialization, and an effective pruning method.
Findings
SNP measures node performance and label clustering.
Pruning via ANDC maintains accuracy.
MHA heads spontaneously specialize in label subsets.
Abstract
Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers, whereas vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers. Both are designed to solve complex classification tasks but from different perspectives. This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism, which quantitatively measures the single-nodal performance (SNP) of each node in feedforward (FF) and multi-head attention (MHA) sub-blocks. Each node identifies small clusters of possible output labels, with additional noise represented as labels outside these clusters. These features are progressively sharpened along the transformer encoders, enhancing the signal-to-noise ratio. This unified underlying learning mechanism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Layer Normalization · Dense Connections · Softmax · Residual Connection · Linear Layer · Vision Transformer · Multi-Head Attention · Pruning
