Can In-context Learning Really Generalize to Out-of-distribution Tasks?
Qixun Wang, Yifei Wang, Yisen Wang, Xianghua Ying

TL;DR
This paper investigates whether in-context learning (ICL) can generalize to out-of-distribution tasks, revealing limitations and the underlying mechanisms through synthetic experiments and theoretical analysis.
Contribution
The study provides new insights into ICL's capabilities and limitations on OOD tasks, including the low-test-error preference and the impact of distributional shifts.
Findings
Transformers struggle to learn OOD functions via ICL.
ICL performance aligns with pretraining hypothesis space optimization.
ICL's ability to learn unseen labels is limited by distributional shifts.
Abstract
In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL's well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL's performance on OOD tasks when the model is…
Peer Reviews
Decision·ICLR 2025 Poster
1. Novel theoretical insights: The paper provides theoretical analysis of the "low-test-error preference" mechanism in ICL, explaining how models select which pretraining function to implement when handling OOD tasks. 2. I found the experiments on the abstract label classification interesting. The paper provides a fresh perspective on how models handle abstract label classification, demonstrating it's more about retrieval capabilities than true OOD learning. 3. Besides GPT-2, they use Llama2-7b
1. Presentation issues. (a) In Eq.1, why the prediction is supervised by f(x_i)? Shouldn’t it be f(x_{I+1}) (b) In Fig.2, what is y_i id? Is it same as I_{y_i}? If yes, why not keep using the same notation? If no, please clarify. (c)Please correct the way how you are using the quote. See examples in the caption of Fig.3. 2. Questions on the training set. The retrieval design in Sec.3.1 is interesting. When the training range is larger (say yi Id in [50, 455]), the test-time performance is
**Originality**: This paper adds to the literature on understanding ICL in transformers on synthetic tasks by empirically and theoretically characterizing their behavior on OOD tasks. To my knowledge, this behavior on OOD tasks has not been carefully demonstrated yet. **Quality**: The paper does thorough, well thought out experiments and provides new theoretical insight. **Clarity**: The paper is clearly written, well structured and easy to understand. **Significance**: The question asked by
- In section 2, it would be helpful to further clarify what is existing knowledge and what is a new contribution. In fact, I think the presentation would be cleaner if Section 2 and Section 4 were combined into a unified presentation. - The explanation of the retrieval version of the Llama-2-7B task was hard to understand, it would be helpful to clarify that section - The conclusion that LLMs cannot learn OOD tasks because its ability to learn abstract concepts from context can be attributed to
- Characterizing the out-of-distribution generalization ability of in-context learning is an important problem with potential impact - The experiments are well-designed to investigate the respective hypotheses - The paper does a good job at referencing and contextualizing existing related work
- The presentation of the paper could be improved, in particular I found section 4.2 hard to follow (see questions) and with the many different experiments often following prior work it was difficult to identify what findings were novel and specific to this paper. - The theoretical results heavily borrow from [1] (but the paper makes this very transparent) [1] Lin, Ziqian, and Kangwook Lee. "Dual operating modes of in-context learning." arXiv preprint arXiv:2402.18819 (2024).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Domain Adaptation and Few-Shot Learning · Machine Learning and Algorithms
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Multi-Head Attention · Dense Connections · Residual Connection · Dropout · Layer Normalization · Linear Warmup With Cosine Annealing
