Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions
Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

TL;DR
This paper introduces the Learning-to-Context Slope (LCS), a new metric for evaluating in-context learning effectiveness in large language models that overcomes limitations of performance-based metrics by capturing continuous learning signals and attributing failures.
Contribution
The paper proposes LCS, a novel, reliable, and data-efficient metric that better assesses ICL effectiveness by modeling the relationship between learning gain and contextual relevance.
Findings
LCS correlates strongly with actual performance improvements.
LCS reliably indicates ICL effectiveness in data-scarce scenarios.
LCS helps identify model capabilities critical for successful ICL.
Abstract
In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
The proposed LCS metric is a novel and potentially useful idea. As this idea goes beyond performance-based evaluations and offers continuous, loss-based measure that remains informative even when model outputs are incorrect, the metric allows for a more refined analysis of ICL capabilities. Moreover, LCS works without labeled data through synthetic evaluation, making it applicable when data is limited. The experiments across multiple datasets and models validate its robust correlation with actua
LCS primarily measures correlation, not causation as a high slope indicates association between loss reduction and context relevance, but does not directly prove that demonstrations cause better learning. Also, the theory and the metric's interpretability depends on strong modelling assumptions. Also, the experiments could be expanded to a broader range of model families.
- Clear motivation. Performance based evaluation can be noisy and hard to attribute. LCS targets the underlying loss dynamics. - Simple mathematical core. The link between learning gain and contextual relevance is expressed as a slope that practitioners can estimate with standard scoring. Because it uses token level loss, LCS can detect progress that accuracy metrics miss. - Although interactions within D cannot be measured precisely and a microscopic view is not provided as noted in weakness on
- LCS is a set level first order summary. It does not model higher order interactions between multiple demonstrations, such as redundancy, synergy, order, and position effects. The paper treats k shot by splitting into k points, which ignores interactions. - Assumptions in theory. Theorems rely on modeling choices and oracle versus empirical probabilities. They are valid under stated conditions, but the practical gap due to estimation error and prompt format choices remains. - Computational cost
- Originality: Framing ICL effectiveness as a slope between two “information-like” quantities is novel and produces a continuous signal where EM/Pass@1 are binary/noisy. - Clarity: The empirical estimator (Eq. 3) and plotting protocol are easy to reproduce conceptually, and the paper is upfront that $r_{\hat{p}}= \hat{p}(D∣Q) / \hat{p}(X∣Q)$ is errorful. (however, see weaknesses and questions for multiple points of unclear presentation.) - Quality: Broad evaluation across 8 datasets on several
1. **Foundational mismatch between Eq. (2) and autoregressive LLMs**: The proof underlying Eq. (2) treats conditionals like $p(D\mid Q;X)$ and $p(D\mid Q)$ as if the order of variables in the conditioning can be freely rearranged; this yields a neat decomposition of loss into a “zero-shot” term and a “demo-dependent” term (Eq. 2). Yet in an autoregressive LLM, $\log p(x_t \mid Q, D, x_{<t})$ depends on the exact **prompt order** $[Q, D, x_{<t}]$. Specifically, for an AR LM, $p(\cdot\mid\cdot)$ i
- Author provides mathematical formulation of LCS, and also provides intuitive analysis on its components. - LCS shows promising empirical results compared to the existing metrics.
- Although authors provide mathematical formulation of LCS, the motivation of choosing LCS to overcome the problems of existing evaluation metric is not clear. - The purpose of 2.3 is not clear. Since they empirically show that LCS can be effective even with synthetic data, I do not think this section is necessary. To me discussing about real large language models, which they placed in the appendix might have been more informative. - Organization of Section 3 is not optimal. The limitation of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
