SimulBench: Evaluating Language Models with Creative Simulation Tasks

Qi Jia; Xiang Yue; Tianyu Zheng; Jie Huang; Bill Yuchen Lin

arXiv:2409.07641·cs.CL·September 13, 2024

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin

PDF

Open Access

TL;DR

SimulBench is a new benchmark for evaluating large language models on creative simulation tasks, using multi-turn dialogues and GPT-4 for automatic assessment, revealing significant performance gaps among models.

Contribution

Introduces SimulBench, a novel benchmark with a fair evaluation framework for creative simulation tasks involving multi-round interactions and GPT-4-based automatic scoring.

Findings

01

Simulation tasks remain challenging for LLMs.

02

GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55% more cases.

03

Open LLMs lag behind proprietary models in simulation tasks.

Abstract

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM's general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on \DataName{}, GPT-4 is employed as the evaluator, tasked with reviewing the quality of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Educational Games and Gamification · Human Motion and Animation

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer