Simulating Hard Attention Using Soft Attention

Andy Yang; Lena Strobl; David Chiang; Dana Angluin

arXiv:2412.09925·cs.LG·June 27, 2025

Simulating Hard Attention Using Soft Attention

Andy Yang, Lena Strobl, David Chiang, Dana Angluin

PDF

Open Access

TL;DR

This paper explores how soft attention mechanisms in transformers can emulate hard attention by using techniques like unbounded positional embeddings and temperature scaling, enabling focused attention on specific input positions.

Contribution

It introduces methods for soft attention transformers to simulate hard attention, including the use of unbounded positional embeddings and temperature-dependent scaling.

Findings

01

Soft attention can simulate hard attention with unbounded positional embeddings.

02

Temperature scaling enables softmax transformers to mimic hard-attention behavior.

03

Transformers can recognize languages defined by linear temporal logic using these techniques.

Abstract

We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several subclasses of languages recognized by hard-attention transformers, which can be defined in variants of linear temporal logic. We demonstrate how soft-attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate general hard-attention transformers, using a temperature that depends on the minimum gap between the maximum attention scores and other attention scores.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Brain Tumor Detection and Classification · Neural Networks and Applications

MethodsAttention Is All You Need · Focus · Softmax