TurtleBench: Evaluating Top Language Models via Real-World Yes/No   Puzzles

Qingchen Yu,Shichao Song,Ke Fang,Yunfeng Shi,Zifan Zheng,Hanyu; Wang,Simin Niu,Zhiyu Li

arXiv:2410.05262·cs.CL·October 8, 2024

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Qingchen Yu,Shichao Song,Ke Fang,Yunfeng Shi,Zifan Zheng,Hanyu, Wang,Simin Niu,Zhiyu Li

PDF

Open Access 1 Repo 1 Datasets

TL;DR

TurtleBench introduces a dynamic, real-world evaluation platform for large language models using user guesses from an online puzzle, providing more reliable assessments of reasoning capabilities beyond static datasets.

Contribution

The paper presents TurtleBench, a novel evaluation framework that leverages real user data to assess LLMs' reasoning in a more realistic and dynamic setting.

Findings

01

OpenAI o1 models did not outperform others in this evaluation.

02

Increasing Chain-of-Thought length may add reasoning benefits but also noise.

03

TurtleBench dataset includes 1,532 user guesses with correctness annotations.

Abstract

As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model's logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mazzzystar/TurtleBench
noneOfficial

Datasets

Duguce/TurtleBench1.5k
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques