Evaluating Large Language Models with Grid-Based Game Competitions: An   Extensible LLM Benchmark and Leaderboard

Oguzhan Topsakal; Colby Jacob Edell; Jackson Bailey Harper

arXiv:2407.07796·cs.AI·July 12, 2024

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

Oguzhan Topsakal, Colby Jacob Edell, Jackson Bailey Harper

PDF

Open Access 1 Repo

TL;DR

This paper introduces an extensible benchmark using grid-based games to evaluate large language models' strategic and rule comprehension abilities, providing detailed data and a leaderboard for comparative analysis.

Contribution

It presents a novel open-source framework for benchmarking LLMs with grid-based games, including a comprehensive dataset and leaderboard for assessing performance across multiple models and prompt types.

Findings

01

Significant performance variation across different LLMs and game types.

02

Analysis of win rates, disqualifications, and invalid moves.

03

Open-access leaderboard and detailed game data.

Abstract

We introduce a novel and extensible benchmark for large language models (LLMs) through grid-based games such as Tic-Tac-Toe, Connect Four, and Gomoku. The open-source game simulation code, available on GitHub, allows LLMs to compete and generates detailed data files in JSON, CSV, TXT, and PNG formats for leaderboard rankings and further analysis. We present the results of games among leading LLMs, including Claude 3.5 Sonnet and Claude 3 Sonnet by Anthropic, Gemini 1.5 Pro and Gemini 1.5 Flash by Google, GPT-4 Turbo and GPT-4o by OpenAI, and Llama3-70B by Meta. We also encourage submissions of results from other LLMs. In total, we simulated 2,310 matches (5 sessions for each pair among 7 LLMs and a random player) across three types of games, using three distinct prompt types: list, illustration, and image. The results revealed significant variations in LLM performance across different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

research-outcome/llm-game-benchmark
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax