Memorization in In-Context Learning

Shahriar Golchin; Mihai Surdeanu; Steven Bethard; Eduardo Blanco,; Ellen Riloff

arXiv:2408.11546·cs.CL·April 7, 2025

Memorization in In-Context Learning

Shahriar Golchin, Mihai Surdeanu, Steven Bethard, Eduardo Blanco,, Ellen Riloff

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how in-context learning (ICL) causes large language models to surface memorized training data, revealing a strong correlation between memorization and improved performance across different ICL regimes.

Contribution

First to analyze the role of memorization in ICL, showing its impact on performance and highlighting the importance of memorized data in few-shot learning.

Findings

01

ICL significantly surfaces memorization compared to zero-shot learning.

02

Demonstrations without labels effectively surface memorization.

03

High levels of memorization correlate with improved ICL performance.

Abstract

In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind this performance improvement remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance on downstream tasks across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

This paper proposes a novel approach to quantify the level of memorization within ICL, contributing to a study in understanding the behavior of LLMs.

Weaknesses

1. The paper presents findings that are already well-known in the field of machine learning: models perform better when they memorize training data, which is often referred to as "data leakage." It is unclear whether the primary motivation of this study is to investigate the relationship between ICL and memorization or to propose a method for quantifying memorization. 1. The paper claims to explore the "correlation between memorization and performance on downstream tasks" (lines 13-14). However

Reviewer 02Rating 5Confidence 3

Strengths

1. The paper is the first to systematically examine the relationship between ICL and memorization in LLMs, providing new insights into how memorized knowledge influences ICL performance. 2. The study uses a detailed approach to measure memorization across multiple settings (full information, segment pairs and labels, and only segment pairs), allowing for a granular analysis of which prompt elements drive memorization in ICL. 3. The paper demonstrates a robust correlation between memorization and

Weaknesses

1. While the experiments are thorough, they are conducted in relatively simple datasets, limiting the paper’s ability to generalize findings to more complex, real-world tasks (e.g., legal, medical datasets). 2. The study does not address potential challenges in handling longer contexts, which are often needed in real-world applications and may limit the practicality of the proposed memorization detection method in large-scale LLMs. 3. While the paper successfully identifies memorization as a fac

Reviewer 03Rating 8Confidence 4

Strengths

The paper is very well written. It is clear and easily to follow, and the experimental setup is also very intuitive. The mechanism by which ICL works is a very relevant and important question.

Weaknesses

The paper only experiments on GPT-4. Authors claim that it is the only LLM that fulfills their criteria but this is somewhat hard to believe, especially given the existence of long-context open-source models. The authors claim that they do not have the resources to run these experiments on e.g. llama3 or some other long-context open-source model that fulfills their criteria, but I believe it would strengthen the paper considerably to have more than a single model for testing. It is not clear wh

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Tools and Methods · Multimodal Machine Learning Applications · Speech and dialogue systems