Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

Wenhan Dong; Tianyi Hu; Jingyi Zheng; Zhen Sun; Yuemeng Zhao; Yule Liu; Xinlei He; Xinyi Huang

arXiv:2505.23843·cs.CL·June 2, 2025

Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

Wenhan Dong, Tianyi Hu, Jingyi Zheng, Zhen Sun, Yuemeng Zhao, Yule Liu, Xinlei He, Xinyi Huang

PDF

Open Access

TL;DR

This paper critically examines the limitations of current evaluation methods for large language models in multi-round incomplete information tasks, revealing issues like shortcut-taking and rigid patterns, and proposes improved evaluation standards.

Contribution

It introduces a refined evaluation framework that includes reasoning path inspection, diversified metrics, and human comparison to better assess LLM reasoning capabilities.

Findings

01

Existing benchmarks often produce misleading results.

02

Current evaluation metrics fail to detect shortcut behaviors.

03

Proposed standards improve assessment reliability.

Abstract

Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEEG and Brain-Computer Interfaces

MethodsSparse Evolutionary Training