DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
Hsuvas Borkakoty, Sebastian Pohl, Cheng Wang, Bei Chen, Yufang Hou

TL;DR
DRIP-R is a benchmark designed to evaluate LLM-based agents' decision-making in real-world retail scenarios with inherent policy ambiguities, highlighting the challenges posed by such ambiguities.
Contribution
It introduces a novel benchmark that leverages real-world retail policy ambiguities, including a comprehensive evaluation framework and realistic simulation scenarios.
Findings
Frontier models often disagree on policy-ambiguous scenarios.
Ambiguity significantly challenges LLM decision-making.
DRIP-R provides a systematic way to evaluate handling of policy ambiguity.
Abstract
LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
