On Many-Shot In-Context Learning for Long-Context Evaluation

Kaijian Zou; Muhammad Khalifa; Lu Wang

arXiv:2411.07130·cs.CL·June 13, 2025

On Many-Shot In-Context Learning for Long-Context Evaluation

Kaijian Zou, Muhammad Khalifa, Lu Wang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper investigates the effectiveness of many-shot in-context learning for long-context language models, proposing new metrics and a benchmark to evaluate models' retrieval and comprehension capabilities across different tasks.

Contribution

It introduces a new benchmark, MANYICLBENCH, and metrics to categorize ICL tasks into retrieval-based and comprehension-based, evaluating 12 models on long contexts up to 64k tokens.

Findings

01

SSL tasks benefit from retrieval of similar samples.

02

ASL tasks require understanding all samples, with performance dropping at 16k tokens.

03

State-of-the-art models perform well on SSL tasks but struggle with ASL tasks at long contexts.

Abstract

Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding. We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (SSL): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (ASL): tasks that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

launchnlp/ManyICLBench
noneOfficial

Datasets

launch/ManyICLBench
dataset· 501 dl
501 dl

Videos

On Many-Shot In-Context Learning for Long-Context Evaluation· underline

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning