CL-bench: A Benchmark for Context Learning

Shihan Dou; Ming Zhang; Zhangyue Yin; Chenhao Huang; Yujiong Shen; Junzhe Wang; Jiayi Chen; Yuchen Ni; Junjie Ye; Cheng Zhang; Huaibing Xie; Jianglu Hu; Shaolei Wang; Weichao Wang; Yanling Xiao; Yiting Liu; Zenan Xu; Zhen Guo; Pluto Zhou; Tao Gui; Zuxuan Wu; Xipeng Qiu; Qi Zhang; Xuanjing Huang; Yu-Gang Jiang; Di Wang; Shunyu Yao

arXiv:2602.03587·cs.CL·February 4, 2026

CL-bench: A Benchmark for Context Learning

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang

PDF

Open Access 4 Datasets

TL;DR

CL-bench is a comprehensive benchmark designed to evaluate language models' ability to learn from complex, task-specific contexts, highlighting current limitations and guiding future improvements for real-world applications.

Contribution

Introduces CL-bench, a large-scale benchmark with complex contexts and tasks to assess models' context learning capabilities beyond pre-training knowledge.

Findings

01

Current models solve only 17.2% of tasks on average.

02

Even GPT-5.1 solves only 23.7% of tasks.

03

Models struggle with learning from complex, real-world contexts.

Abstract

Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications