Stress-Testing Long-Context Language Models with Lifelong ICL and Task   Haystack

Xiaoyue Xu; Qinyuan Ye; Xiang Ren

arXiv:2407.16695·cs.CL·December 4, 2024

Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack

Xiaoyue Xu, Qinyuan Ye, Xiang Ren

PDF

1 Repo 1 Video

TL;DR

This paper introduces Lifelong ICL and Task Haystack to evaluate long-context language models' ability to learn and utilize multiple tasks over time, revealing significant challenges and vulnerabilities in current models like GPT-4o.

Contribution

The paper proposes a new evaluation framework, Task Haystack, for diagnosing long-context LMs in Lifelong ICL settings, highlighting their limitations and failure modes.

Findings

01

GPT-4o struggles with 15% failure rate on Task Haystack

02

Open-weight models fail up to 61% of the time

03

Distraction and recency bias affect model performance

Abstract

We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than those of the Single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted "needle-in-a-haystack" (NIAH) evaluation, but presents distinct new challenges. It requires models (1) to utilize the contexts at a deeper level, rather than resorting to simple copying and pasting; (2) to navigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ink-usc/lifelong-icl
noneOfficial

Videos

Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack· slideslive