CL-bench Life: Can Language Models Learn from Real-Life Context?

Shihan Dou; Yujiong Shen; Chenhao Huang; Junjie Ye; Jiayi Chen; Junzhe Wang; Qianyu He; Shichun Liu; Changze Lv; Jiahang Lin; Jiazheng Zhang; Ming Zhang; Shaofan Liu; Tao Ji; Zhangyue Yin; Cheng Zhang; Huaibing Xie; Jianglu Hu; Jingcheng Deng; Lincheng Li; Minda Hu; Shaolei Wang; Syrus Zhao; Weichao Wang; Yan Lei; Yang Liu; Yanling Xiao; Yiting Liu; Zenan Xu; Zhen Guo; Ziliang Zhao; Pluto Zhou; Tao Gui; Qi Zhang; Xuanjing Huang; Yu-Gang Jiang; Di Wang; Shunyu Yao

arXiv:2604.27043·cs.CL·May 1, 2026

CL-bench Life: Can Language Models Learn from Real-Life Context?

Shihan Dou, Yujiong Shen, Chenhao Huang, Junjie Ye, Jiayi Chen, Junzhe Wang, Qianyu He, Shichun Liu, Changze Lv, Jiahang Lin, Jiazheng Zhang, Ming Zhang, Shaofan Liu, Tao Ji, Zhangyue Yin, Cheng Zhang, Huaibing Xie, Jianglu Hu, Jingcheng Deng, Lincheng Li, Minda Hu, Shaolei Wang

PDF

TL;DR

CL-bench Life is a new benchmark testing whether current language models can learn from complex, messy real-life contexts, revealing significant challenges and room for improvement.

Contribution

The paper introduces CL-bench Life, a human-curated benchmark with 405 context-task pairs to evaluate models' ability to learn from real-life scenarios.

Findings

01

Top model achieves only 19.3% task solving rate.

02

Average model performance is 13.8%.

03

Models struggle with messy, fragmented real-life contexts.

Abstract

Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them. To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.