Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning
Panayiotis Danassis, Naman Goel

TL;DR
This paper benchmarks LLM-generated code against human-coded solutions in a complex logistics optimization game, revealing that humans outperform current LLMs in strategic planning and optimization tasks.
Contribution
It introduces a new reasoning-driven benchmark for real-world logistics problems and compares LLMs with human-coded agents, highlighting current limitations of LLMs in strategic coding tasks.
Findings
Humans outperform LLM-coded agents in the benchmark.
Most LLM agents are beaten by simple baselines.
LLMs struggle to improve upon human solutions when prompted.
Abstract
The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Language and cultural evolution
