EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Shihan Dou; Ming Zhang; Chenhao Huang; Jiayi Chen; Feng Chen; Shichun Liu; Yan Liu; Chenxiao Liu; Cheng Zhong; Zongzhang Zhang; Tao Gui; Chao Xin; Chengzhi Wei; Lin Yan; Yonghui Wu; Qi Zhang; Xuanjing Huang

arXiv:2506.02672·cs.CL·October 22, 2025

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Chengzhi Wei, Lin Yan, Yonghui Wu, Qi Zhang, Xuanjing Huang

PDF

Open Access 1 Video

TL;DR

EvaLearn is a new benchmark that assesses large language models' ability to learn and improve through sequential problem solving across diverse tasks, revealing different learning profiles and capabilities.

Contribution

It introduces EvaLearn, a benchmark with 648 problems in sequences, and proposes metrics to evaluate LLMs' learning ability and efficiency, highlighting new evaluation dimensions.

Findings

01

Some models show strong learning ability after initial performance.

02

Certain models struggle with experience and may negatively transfer.

03

Model performance varies significantly across tasks and settings.

Abstract

We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving· slideslive

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques