Leveraging Word Guessing Games to Assess the Intelligence of Large   Language Models

Tian Liang; Zhiwei He; Jen-tse Huang; Wenxuan Wang and; Wenxiang Jiao; Rui Wang; Yujiu Yang; Zhaopeng Tu; Shuming Shi and; Xing Wang

arXiv:2310.20499·cs.CL·November 7, 2023·1 cites

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

Tian Liang, Zhiwei He, Jen-tse Huang, Wenxuan Wang and, Wenxiang Jiao, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi and, Xing Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel, game-based framework using word guessing games to evaluate large language models' intelligence, emphasizing strategic communication, adaptability, and multi-agent interaction, offering a cost-effective alternative to traditional datasets.

Contribution

It proposes DEEP and SpyGame frameworks that assess LLMs' expression, disguising, and strategic skills through interactive, multi-agent language games, advancing evaluation methods for AI intelligence.

Findings

01

DEEP effectively measures LLMs' descriptive and disguising abilities.

02

SpyGame captures LLMs' strategic thinking and adaptability.

03

Framework is easy to implement across multiple languages and domains.

Abstract

The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

skytliang/spygame
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification