DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

Hsuvas Borkakoty; Sebastian Pohl; Cheng Wang; Bei Chen; Yufang Hou

arXiv:2605.07699·cs.CL·May 11, 2026

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

Hsuvas Borkakoty, Sebastian Pohl, Cheng Wang, Bei Chen, Yufang Hou

PDF

TL;DR

DRIP-R is a benchmark designed to evaluate LLM-based agents' decision-making in real-world retail scenarios with inherent policy ambiguities, highlighting the challenges posed by such ambiguities.

Contribution

It introduces a novel benchmark that leverages real-world retail policy ambiguities, including a comprehensive evaluation framework and realistic simulation scenarios.

Findings

01

Frontier models often disagree on policy-ambiguous scenarios.

02

Ambiguity significantly challenges LLM decision-making.

03

DRIP-R provides a systematic way to evaluate handling of policy ambiguity.

Abstract

LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.