Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors
Jingquan Yan, Yuwei Miao, Peiran Yu, Junzhou Huang

TL;DR
This paper provides a theoretical analysis of the PCC plateau phenomenon in attention-based regressors, revealing fundamental limitations and proposing a new mechanism, ECA, to surpass these limits and improve correlation.
Contribution
It uncovers the causes of the PCC plateau related to optimization conflicts and model capacity, and introduces ECA to overcome these limitations.
Findings
ECA consistently breaks the PCC plateau across benchmarks.
Theoretical analysis reveals conflicts between MSE and PCC optimization.
Model capacity limits PCC improvements within the convex hull of inputs.
Abstract
Attention-based regression models are often trained by jointly optimizing Mean Squared Error (MSE) loss and Pearson correlation coefficient (PCC) loss, emphasizing the magnitude of errors and the order or shape of targets, respectively. A common but poorly understood phenomenon during training is the PCC plateau: PCC stops improving early in training, even as MSE continues to decrease. We provide the first rigorous theoretical analysis of this behavior, revealing fundamental limitations in both optimization dynamics and model capacity. First, in regard to the flattened PCC curve, we uncover a critical conflict where lowering MSE (magnitude matching) can paradoxically suppress the PCC gradient (shape matching). This issue is exacerbated by the softmax attention mechanism, particularly when the data to be aggregated is highly homogeneous. Second, we identify a limitation in the model…
Peer Reviews
Decision·ICLR 2026 Poster
1. The writing is well structured and easy to follow. The theoretical derivations are transparent and carefully presented. 2. The work targets an interesting question: how attention-based regressors behave when trained with both magnitude and correlation-based losses. This is a topic that deserves rigorous study and of broad interest to the community. 3. The authors validate their method across a diverse and well-chosen set of benchmarks, which strongly supports the generality and effectiveness
1. The theoretical analysis is based on a simplified model: a single attention aggregation layer followed by a linear head. However, the models used in practice are significantly more complex. The paper does not discuss how other architectural elements might interact with or alleviate the identified problems. 2. The experiments only validate the downstream effects rather than the mechanisms proposed by the theory. The paper would be more convincing with additional empirical evidence that isolat
Overall, the paper is well written. Here are my understanding of the strength part: 1. The paper identifies and analyzes an interesting “PCC plateau” phenomenon, an observed but under-theorized failure mode in attention-based regression models trained with joint MSE+PCC loss. In my perspective, because the phenomenon emerges in multiple architectures and datasets, the problem is likely to generalize and will be of interest to both theory-leaning and applied ML audiences. 2. On the theory side
While the paper provides a principled and well-motivated analysis, several aspects could be further clarified or extended: 1. Architectural depth and generality. Most theoretical and synthetic experiments use a one-layer transformer. It remains unclear whether the proposed ECA mechanism yields similar improvements in deeper architectures, where subsequent layers (especially MLPs) could partially overcome the convex-hull limitation. Without results on multi-layer settings, the contribution may a
**Novel topics and clear interpretation:** I like the topic studied in this paper since learning PCC is an important problem, and there lacks theoretical understanding of why the PCC plateau happens. Moreover, the decomposition in Proposition 2.1 offers a clean interpretation of how the matching of mean, std, and correlation affect the MSE. **Clear theoretical analysis** In Section 2, the authors provide several propositions and theorems to argue why the PCC plateau happens by analyzing the gra
**Missing comparison of gradient between MSE and PCC:** In Section 2.3, the author present gradient of PCC w.r.t. attention logits (see Theorem 2.1), then provide a bound for this gradient in Corollary 2.1, and finally discuss the implications of the bounds. It is clear how each term affect the gradient for PCC. However, it lacks the discussions of the bounds for MSE as well. Since this paper argues *PCC tends to flat earlier while MSE continues to decrease*, it is important to discuss the gradi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Face recognition and analysis
