Is logical analysis performed by transformers taking place in   self-attention or in the fully connected part?

Evgeniy Shin; Heinrich Matzinger

arXiv:2501.11765·cs.CL·January 22, 2025

Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

Evgeniy Shin, Heinrich Matzinger

PDF

Open Access

TL;DR

This paper investigates whether logical analysis in transformers occurs in self-attention or in the fully connected layers, revealing that self-attention can perform logical operations, challenging traditional views.

Contribution

The study introduces a handcrafted encoder layer performing logical analysis within self-attention and analyzes how models learn to utilize self-attention versus fully connected layers.

Findings

01

Self-attention can perform logical analysis, not just information aggregation.

02

Gradient descent can get stuck at undesired zeros, affecting learning.

03

Explicit methods to avoid these zeros improve model training.

Abstract

Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, Reasoning, and Knowledge

MethodsSelf-Learning