TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation
Jialin Ouyang

TL;DR
TreeCut is a synthetic dataset designed to evaluate large language models' ability to recognize unanswerable math problems, revealing their tendency to hallucinate confidently on such questions.
Contribution
We introduce TreeCut, a novel dataset that systematically generates unanswerable math problems to evaluate LLM hallucinations, highlighting persistent challenges in model reasoning.
Findings
LLMs hallucinate on 64% of unanswerable problems in worst cases
Deeper and more complex trees increase hallucination likelihood
Removing necessary conditions near the middle of a question path raises hallucination rates
Abstract
Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Education · Intelligent Tutoring Systems and Adaptive Learning · Mathematics, Computing, and Information Processing
