Trapped by simplicity: When Transformers fail to learn from noisy features
Evan Peters, Ando Deng, Matheus H. Zambianco, Devin Blankespoor, Achim Kempf

TL;DR
This paper investigates the ability of transformers to learn noise-robust boolean functions, revealing their limitations due to simplicity bias and sensitivity issues, and proposes methods to mitigate these failures.
Contribution
It demonstrates that transformers can learn certain noise-robust functions but struggle with others, and introduces techniques to overcome their bias towards simpler solutions.
Findings
Transformers succeed at noise-robust learning for sparse parity and majority functions.
Transformers fail at learning random k-juntas with noisy features.
Adding a sensitivity-penalizing loss improves transformers' ability to learn noisy functions.
Abstract
Noise is ubiquitous in data used to train large language models, but it is not well understood whether these models are able to correctly generalize to inputs generated without noise. Here, we study noise-robust learning: are transformers trained on data with noisy features able to find a target function that correctly predicts labels for noiseless features? We show that transformers succeed at noise-robust learning for a selection of -sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random -juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers' bias toward simpler functions, combined with an…
Peer Reviews
Decision·ICLR 2026 Poster
- The motivation for the theoretical setup is clear (e.g. inspiration from modern day LLM training). - Extensive experiments are provided for demonstrating when transformers can and cannot learn boolean functions (e.g. which $k$ in a $k$-sparse parity in which learning with feature noise is possible, as well as majority functions, and other $k$-juntas). - The analysis of the sensitivity of the optimal predictor under feature noise versus the teacher is an interesting perspective, and a conjectur
- The analysis for parity and majority seem quite straightforward, and it would be interesting if there was more theoretical analysis on progress towards the conjecture. - For the LSTM model, it would be useful to have some theoretical analysis of this setting for learning boolean functions too. - Perhaps an example of a realistic setting of feature noise in training transformers would be useful, and I believe this would better motivate this paper.
* Casting training-on-noisy-features as a noisy-channel problem, with $f_N^*$ characterized by the noise operator and performance tied to $H(Y|Z)$, gives a precise target for comparison. * Extensive hyperparameter sweeps and repeated trials (300 per condition) improve the reliability of the conclusions. * The use of total influence $I[f]$ to quantify simplicity connects to prior theory and cleanly explains why models can perform well on noisy validation yet fail on noiseless evaluation. * The
* Narrow noise model and data distribution. All inputs are uniformly random bitstrings with iid symmetric bit‑flip noise and memoryless corruption. Real text has structured distributions and correlated, non‑binary errors (insertions, deletions, paraphrases). The paper acknowledges this but leaves generality uncertain. * Task scope. Results hinge on Boolean functions; while parity/majority and k‑juntas are classic, evidence that the same mechanisms dominate in natural language or code remains in
1. The problem formulation is novel by adding a perspective on robustness. Noise-robust learning is valuable but yet relatively under explored. 2. The paper has sound theoretical analysis and the theoretical results provide significant insights. It deploys various mathematical tools efficiently, including Boolean analysis, information theory, and learning theory. 3. Experiments, although relatively small-scaled, has a clear target on the conjecture and provides valuable support 4. The paper i
My main concern is on the applicability and scope of this study. The investigated problems (parity and junta) are binary-input problems with rigid mathematical structures. This fact provides simplicity for analysis, but at the same time they are restricted because real-world data and noises are much more complicated.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Stochastic Gradient Optimization Techniques
