RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Linghua Zhang; Jun Wang; Jingtong Wu; Zhisong Zhang

arXiv:2603.16453·cs.AI·March 18, 2026

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

PDF

Open Access

TL;DR

RetailBench provides a comprehensive benchmark for assessing long-term decision-making and strategy stability of LLM agents in realistic retail scenarios, highlighting current limitations and guiding future improvements.

Contribution

The paper introduces RetailBench, a new benchmark for long-horizon decision-making in retail environments, and proposes the Evolving Strategy & Execution framework for adaptive, interpretable strategies.

Findings

01

Framework improves stability and efficiency of LLM agents.

02

Performance declines with increasing task complexity.

03

Current LLMs face fundamental challenges in long-horizon, multi-factor decision tasks.

Abstract

Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling