In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks
Ayush Goel, Arjun Kohli, Sarvagya Somvanshi

TL;DR
This paper empirically compares in-context learning capabilities of linear and quadratic attention models on linear regression tasks, analyzing their performance, convergence, and effects of model depth.
Contribution
It provides a comparative empirical analysis of linear versus quadratic attention mechanisms in in-context learning for regression tasks, highlighting their similarities and limitations.
Findings
Linear and quadratic attention models show similar ICL performance on regression.
Increasing model depth impacts ICL performance differently across architectures.
Linear attention models have limitations compared to quadratic attention in this setting.
Abstract
Recent work has demonstrated that transformers and linear attention models can perform in-context learning (ICL) on simple function classes, such as linear regression. In this paper, we empirically study how these two attention mechanisms differ in their ICL behavior on the canonical linear-regression task of Garg et al. We evaluate learning quality (MSE), convergence, and generalization behavior of each architecture. We also analyze how increasing model depth affects ICL performance. Our results illustrate both the similarities and limitations of linear attention relative to quadratic attention in this setting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Human Pose and Action Recognition · Face recognition and analysis
