Rethinking Invariance in In-context Learning
Lizhe Fang, Yifei Wang, Khashayar Gatmiry, Lei Fang, Yisen Wang

TL;DR
This paper introduces InvICL, a new method for in-context learning that achieves permutation invariance while maintaining high performance, leading to better generalization across different input lengths.
Contribution
The paper identifies key elements for invariant ICL and proposes InvICL, which outperforms existing methods in benchmarks by balancing invariance and information retention.
Findings
InvICL surpasses previous models in benchmark datasets.
InvICL demonstrates superior generalization across input lengths.
Existing invariant methods often trade off performance for invariance.
Abstract
In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous…
Peer Reviews
Decision·ICLR 2025 Poster
The idea is novel and interesting. The motivation is clear as InvICL is proposed by the desired three properties.
1. Experiment results need more analysis and interpretations. The authors find InvICL shows better length-generalization capabilities, whose mechanism is unclear to me. 2. More experiment results needed. Most results do not have a reported std. Besides, it would be benificial if there is a figure of squared error curves where the x-axis is the training epochs. Currently there are only results from 50k and 200k epochs. 3. The results of Prefix ICL for linear regression are a bit weird to me.
- The paper identifies and formalizes three important properties for ICL that weren't previously unified. The authors demonstrate why these properties matter. - The theoretical analysis is simple but straightforward. The authors prove that InvICL approximates standard gradient descent (Theorem 4.1) and show how this leads to better convergence properties compared to other ICL variants. - The experimental results look interesting. The method shows strong performance across multiple settings - sy
- The practical applicability of the method raises some concerns. The paper relies on MetaICL finetuning, which is computationally expensive for modern large language models. I wonder if there are any training-free methods. - The efficiency implications are concerning. Doubling the input sequence length (as shown in Figure 2d) increases memory usage. In Section 5.2, “We find that when the inputs size of the GPT-2 Large model increases from 512 to 1024, the GPU memory overhead increases by 14% (
- The paper presents a clear step up in terms of theoretical ideas as well as empirical evidence of improvement compared to prior works attempting to optimize ICL. - Paper is very well written, clearly presents and distinguishes its contribution. - Although the method requires more computation in theory, the authors achieve parallelism and same order of computation as standard ICL with a smart trick. The authors also talk about the additional memory requirements. [Point being that the paper addr
**The significance of section 4 and section 5.1 are unclear.** Please see this [ICML 2024 paper](https://arxiv.org/abs/2310.08540) that talks about how training transformers with ICL objective may be incompatible with real ICL in LLMs that do not train explicitly for ICL with fixed ICL prompt format. - Theorem 4.1 shows that if we put weight matrices in a particular format, we can simulate InvICL with transformers. But, is there reason to believe that trained transformers end up with similar wei
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Machine Learning in Healthcare
