DeliveryBench: Can Agents Earn Profit in Real World?
Lingjun Mao, Jiawei Ren, Kun Zhou, Jixuan Chen, Ziqiao Ma, Lianhui Qin

TL;DR
DeliveryBench introduces a realistic, city-scale benchmark for evaluating embodied agents in food delivery scenarios, emphasizing long-term profit maximization and constraint management, revealing current AI limitations compared to humans.
Contribution
We propose DeliveryBench, a novel benchmark for assessing long-horizon, constraint-aware planning of embodied agents in realistic, city-scale environments based on food delivery.
Findings
VLM-based agents underperform compared to humans.
Current agents are short-sighted and violate constraints.
Distinct agent personalities emerge, showing diversity and brittleness.
Abstract
LLMs and VLMs are increasingly deployed as embodied agents, yet existing benchmarks largely revolve around simple short-term tasks and struggle to capture rich realistic constraints that shape real-world decision making. To close this gap, we propose DeliveryBench, a city-scale embodied benchmark grounded in the real-world profession of food delivery. Food couriers naturally operate under long-horizon objectives (maximizing net profit over hours) while managing diverse constraints, e.g., delivery deadline, transportation expense, vehicle battery, and necessary interactions with other couriers and customers. DeliveryBench instantiates this setting in procedurally generated 3D cities with diverse road networks, buildings, functional locations, transportation modes, and realistic resource dynamics, enabling systematic evaluation of constraint-aware, long-horizon planning. We benchmark a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTransportation and Mobility Innovations · Multimodal Machine Learning Applications · Mobile Crowdsensing and Crowdsourcing
