Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

Jerry Yao-Chieh Hu; Xiwen Zhang; Maojiang Su; Zhao Song; Han Liu

arXiv:2505.19531·cs.LG·May 27, 2025

Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

Jerry Yao-Chieh Hu, Xiwen Zhang, Maojiang Su, Zhao Song, Han Liu

PDF

Open Access 5 Reviews

TL;DR

This paper investigates the capabilities of a minimalist single-head softmax-attention mechanism in learning constrained Boolean functions, revealing its limitations and potential with supervision, and highlighting fundamental gaps in what it can achieve.

Contribution

It demonstrates that simple attention mechanisms cannot learn Boolean functions without supervision but can do so with teacher forcing, emphasizing minimal architecture sufficiency and one-step training.

Findings

01

Single-head softmax attention cannot learn AND/OR functions without supervision.

02

Teacher forcing enables learning these Boolean functions with the same minimalist attention.

03

One gradient update with supervision can replace multi-step reasoning schemes.

Abstract

We study the computational limits of learning $k$ -bit Boolean functions (specifically, $AND$ , $OR$ , and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k = Θ (d)$ relevant bits are selected from $d$ inputs. We show that these simple $AND$ and $OR$ functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 8Confidence 3

Strengths

The majority of the work is presented very clearly. For example, the research question is made very explicit: "can gradient descent training on input-output examples learn to attend to the correct k bita and reliably compute the AND/OR?". Overall, as an outsider to this domain, I found the paper very insightful and qualitative. Only the soundness of the contribution I could not verify. The work is well motivated, expanding our knowledge of what a softmax-attention mechanism can and cannot do, s

Weaknesses

Novelty. While certainly different, the work heavily connects to [Kim & Suzuki, 2025] who have studied the ability of the transformer architecture to learn the parity function. This work instead studies an easier Boolean problem, expanding insights of the architecture. The work certainly appears novel, but I was unable to assess this with high confidence given it is outside my expertise. The presentation of the contributions was very clear. However, minor remark, some parts of the contribution

Reviewer 02Rating 4Confidence 4

Strengths

On the originality side, I found the deep analysis on hits (\w vs \wo) for monotone (k)-bit AND/OR, complementing recent CoT body of research. The proof strategy is also smart and I like it. The noisy-hint extension (like a causal intervention) and a local-majority variant broaden the conceptual message beyond the exact noiseless AND/OR idealization. I believe the paper’s thesis and experimentation is clearly articulated and timely. Even with strong idealizations, the result is a minimal testbed

Weaknesses

1. I am afraid the analysis looks at the issue from a very idealized perspective. If my understanding is correct, the attention is content-independent, meaning that a free parameter W replaces $K^\top Q$ and does not depend on X! This makes the module a fixed positional mixer, not self-attention and it cannot adapt attention to the input content. Thus, claims like “one-head softmax attention learns high-dimensional Boolean concepts” risk being over-interpreted, meaning that what is learned here

Reviewer 03Rating 2Confidence 4

Strengths

1. **One-step gradient derivation under oracle supervision.** The paper rigorously derives that, given perfect intermediate supervision, a single gradient step from $$ W^{(0)} = 0 \quad\Rightarrow\quad W^{(1)}_{j,m} = \frac{d^{\varepsilon/8}}{8}\mathbf{1}{p[j]=m} + O(d^{-\varepsilon/8}), $$ leads to an attention mask satisfying $$ |2,\mathrm{Softmax}(W^{(1)})\mathbf{1}*t - v_b|*\infty = O(d^{-\varepsilon/8}), $$ meaning the model exactly identifies the relevant subs

Weaknesses

1. **Misleading framing of contributions.** The abstract and introduction claim a “fundamental computational limit” for monotone AND/OR functions, yet the paper never proves a formal lower bound for these functions. The only lower bound provided is a generic PAC-style hardness lemma of the form $ \mathbb{E}*{b,x}!\left[\min_j |(v_b - f*\theta(x,y))_j|\right] \ge \min!\left{\frac{k}{d}, 1 - \frac{k}{d}\right} - e^{-\Theta(d)}, $ which is derived directly from parity and majorit

Reviewer 04Rating 6Confidence 4

Strengths

1. Analyzing learning k-sparse AND/OR is well motivated by previous work on learning parity with and without teacher forcing. This is an even simpler setting than that prior work 2. The high level results and structure of the paper are largely clear. The comparison between Theorem 4.1 and 4.3 gives theoretical evidence for a supervision gap on the k-sparse AND/OR problem

Weaknesses

### Interpretation of Results > In real-world scenarios of this form, relying on end-to-end training alone(with no auxiliary signals) may be doomed to fail Technically you have not proven that a more complicated model with more heads and layers could not efficiently learn k-AND. Yes, expanding the model class seems like it wouldn't make learning easier, but perhaps there is some kind of complex inductive bias that emerges in larger models (or a lottery ticket effect with multiple heads) that w

Reviewer 05Rating 2Confidence 3

Strengths

Paper is clear and understandable, although prone to overstatement in places.

Weaknesses

The results in this paper are a minor variation of Kim and Suzuki (2025), who instantiate the results of Shalev-Schwartz et al (2017) in the realm of transformer networks. In light of Shalev-Schwartz et al (2017), one expects there are infinitely many architectural variations that support a dichotomy where end-to-end learning fails, but decomposition succeeds.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and Algorithms · Machine Learning and Data Classification

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing