TL;DR
This paper introduces Lifelong ICL and Task Haystack to evaluate long-context language models' ability to learn and utilize multiple tasks over time, revealing significant challenges and vulnerabilities in current models like GPT-4o.
Contribution
The paper proposes a new evaluation framework, Task Haystack, for diagnosing long-context LMs in Lifelong ICL settings, highlighting their limitations and failure modes.
Findings
GPT-4o struggles with 15% failure rate on Task Haystack
Open-weight models fail up to 61% of the time
Distraction and recency bias affect model performance
Abstract
We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than those of the Single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted "needle-in-a-haystack" (NIAH) evaluation, but presents distinct new challenges. It requires models (1) to utilize the contexts at a deeper level, rather than resorting to simple copying and pasting; (2) to navigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
