UNO Arena for Evaluating Sequential Decision-Making Capability of Large   Language Models

Zhanyue Qin; Haochuan Wang; Deyuan Liu; Ziyang Song; Cunhang Fan; Zhao; Lv; Jinlin Wu; Zhen Lei; Zhiying Tu; Dianhui Chu; Xiaoyan Yu; Dianbo Sui

arXiv:2406.16382·cs.CL·June 25, 2024

UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models

Zhanyue Qin, Haochuan Wang, Deyuan Liu, Ziyang Song, Cunhang Fan, Zhao, Lv, Jinlin Wu, Zhen Lei, Zhiying Tu, Dianhui Chu, Xiaoyan Yu, Dianbo Sui

PDF

Open Access 1 Video

TL;DR

This paper introduces UNO Arena, a novel evaluation framework using the UNO card game to assess the sequential decision-making abilities of large language models, and proposes the TUTRI player to enhance these capabilities.

Contribution

It presents UNO Arena as a new benchmark for evaluating LLMs' decision-making, and introduces the TUTRI player to improve LLM performance in sequential tasks.

Findings

01

TUTRI player significantly outperforms vanilla LLM players.

02

UNO Arena provides a dynamic and effective evaluation environment.

03

LLMs can be enhanced to better handle sequential decision-making.

Abstract

Sequential decision-making refers to algorithms that take into account the dynamics of the environment, where early decisions affect subsequent decisions. With large language models (LLMs) demonstrating powerful capabilities between tasks, we can't help but ask: Can Current LLMs Effectively Make Sequential Decisions? In order to answer this question, we propose the UNO Arena based on the card game UNO to evaluate the sequential decision-making capability of LLMs and explain in detail why we choose UNO. In UNO Arena, We evaluate the sequential decision-making capability of LLMs dynamically with novel metrics based Monte Carlo methods. We set up random players, DQN-based reinforcement learning players, and LLM players (e.g. GPT-4, Gemini-pro) for comparison testing. Furthermore, in order to improve the sequential decision-making capability of LLMs, we propose the TUTRI player, which can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models· underline

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam