Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Axel Backlund; Lukas Petersson

arXiv:2502.15840·cs.AI·February 25, 2025

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Axel Backlund, Lukas Petersson

PDF

Open Access

TL;DR

Vending-Bench is a new benchmark environment that tests the long-term decision-making coherence of LLM-based agents in a simulated vending machine business, revealing high variance and failure modes over extended interactions.

Contribution

This paper introduces Vending-Bench, a novel long-term testing environment for LLMs, highlighting their challenges in sustained coherent decision-making over extended tasks.

Findings

01

High variance in LLM performance over long horizons

02

Failures not correlated with context window limits

03

Models can acquire capital but often derail in long-term tasks

Abstract

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation

MethodsSparse Evolutionary Training