LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete   Information from Lateral Thinking Puzzles

Shulin Huang; Shirong Ma; Yinghui Li; Mengzuo Huang; Wuhe Zou; Weidong; Zhang; Hai-Tao Zheng

arXiv:2308.10855·cs.CL·March 19, 2024·1 cites

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

Shulin Huang, Shirong Ma, Yinghui Li, Mengzuo Huang, Wuhe Zou, Weidong, Zhang, Hai-Tao Zheng

PDF

Open Access 1 Repo

TL;DR

LatEval introduces an interactive benchmark to evaluate LLMs' lateral thinking abilities through puzzles, revealing that even advanced models like GPT-4 still lag behind humans in this challenging aspect.

Contribution

This paper presents LatEval, a novel benchmark for assessing lateral thinking in LLMs within an interactive setting, highlighting their current limitations.

Findings

01

LLMs struggle with lateral thinking during interactions.

02

GPT-4 shows some advantage but still lags behind humans.

03

LatEval provides a challenging task for improving AI assistants.

Abstract

With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thukelab/lateval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · AI in Service Interactions

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections