ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM   Reasoning

Shulin Huang; Linyi Yang; Yan Song; Shuang Chen; Leyang Cui; Ziyu Wan,; Qingcheng Zeng; Ying Wen; Kun Shao; Weinan Zhang; Jun Wang; Yue Zhang

arXiv:2502.16268·cs.CL·February 25, 2025

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan,, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, Yue Zhang

PDF

Open Access

TL;DR

ThinkBench introduces a dynamic evaluation framework for large language models that effectively assesses reasoning capabilities by generating out-of-distribution datasets, addressing issues of data contamination and answer leakage.

Contribution

This work presents a novel dynamic data generation method and a comprehensive evaluation framework for robustly assessing LLM reasoning, unifying reasoning and non-reasoning model evaluation.

Findings

01

Most LLMs lack robustness in reasoning tasks.

02

Dynamic OOD datasets reduce data leakage effects.

03

Performance varies significantly across models.

Abstract

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning and Algorithms · Business Process Modeling and Analysis