RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

TL;DR
RetailBench provides a comprehensive benchmark for assessing long-term decision-making and strategy stability of LLM agents in realistic retail scenarios, highlighting current limitations and guiding future improvements.
Contribution
The paper introduces RetailBench, a new benchmark for long-horizon decision-making in retail environments, and proposes the Evolving Strategy & Execution framework for adaptive, interpretable strategies.
Findings
Framework improves stability and efficiency of LLM agents.
Performance declines with increasing task complexity.
Current LLMs face fundamental challenges in long-horizon, multi-factor decision tasks.
Abstract
Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
