Unified CNNs and transformers underlying learning mechanism reveals   multi-head attention modus vivendi

Ella Koresh; Ronit D. Gross; Yuval Meir; Yarden Tzach; Tal Halevi; and; Ido Kanter

arXiv:2501.12900·cs.LG·April 10, 2025

Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi

Ella Koresh, Ronit D. Gross, Yuval Meir, Yarden Tzach, Tal Halevi, and, Ido Kanter

PDF

Open Access

TL;DR

This paper reveals a unified underlying learning mechanism in CNNs and ViT architectures, showing how nodes identify label clusters and how MHA heads specialize, leading to efficient pruning and head cooperation.

Contribution

It introduces a quantitative SNP-based framework unifying CNNs and ViTs, demonstrating label cluster sharpening, head specialization, and an effective pruning method.

Findings

01

SNP measures node performance and label clustering.

02

Pruning via ANDC maintains accuracy.

03

MHA heads spontaneously specialize in label subsets.

Abstract

Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers, whereas vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers. Both are designed to solve complex classification tasks but from different perspectives. This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism, which quantitatively measures the single-nodal performance (SNP) of each node in feedforward (FF) and multi-head attention (MHA) sub-blocks. Each node identifies small clusters of possible output labels, with additional noise represented as labels outside these clusters. These features are progressively sharpened along the transformer encoders, enhancing the signal-to-noise ratio. This unified underlying learning mechanism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Layer Normalization · Dense Connections · Softmax · Residual Connection · Linear Layer · Vision Transformer · Multi-Head Attention · Pruning