TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

Jialin Ouyang

arXiv:2502.13442·cs.CL·May 21, 2025

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

Jialin Ouyang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

TreeCut is a synthetic dataset designed to evaluate large language models' ability to recognize unanswerable math problems, revealing their tendency to hallucinate confidently on such questions.

Contribution

We introduce TreeCut, a novel dataset that systematically generates unanswerable math problems to evaluate LLM hallucinations, highlighting persistent challenges in model reasoning.

Findings

01

LLMs hallucinate on 64% of unanswerable problems in worst cases

02

Deeper and more complex trees increase hallucination likelihood

03

Removing necessary conditions near the middle of a question path raises hallucination rates

Abstract

Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

j-bagel/treecut-math
noneOfficial

Datasets

jouyang/treecut-math
dataset· 187 dl
187 dl

Videos

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation· underline

Taxonomy

TopicsArtificial Intelligence in Education · Intelligent Tutoring Systems and Adaptive Learning · Mathematics, Computing, and Information Processing