TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
Qingchen Yu,Shichao Song,Ke Fang,Yunfeng Shi,Zifan Zheng,Hanyu, Wang,Simin Niu,Zhiyu Li

TL;DR
TurtleBench introduces a dynamic, real-world evaluation platform for large language models using user guesses from an online puzzle, providing more reliable assessments of reasoning capabilities beyond static datasets.
Contribution
The paper presents TurtleBench, a novel evaluation framework that leverages real user data to assess LLMs' reasoning in a more realistic and dynamic setting.
Findings
OpenAI o1 models did not outperform others in this evaluation.
Increasing Chain-of-Thought length may add reasoning benefits but also noise.
TurtleBench dataset includes 1,532 user guesses with correctness annotations.
Abstract
As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model's logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
