When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State
Peiying Zhu, Sidi Chang

TL;DR
This paper introduces a trace-based evaluation paradigm to assess whether reinforcement learning agents truly preserve behavioral discipline, especially under hidden states, beyond just achieving business KPIs.
Contribution
It proposes a new evaluation framework that isolates behavioral fidelity from outcome metrics, including diagnostics, ablations, and transfer tests, demonstrated on hotel pricing and bidding tasks.
Findings
Reward-only PPO variants fail to maintain trace alignment.
Hidden states reduce label uncertainty and improve evaluation.
Trace-prior policies better preserve price and bid distributions.
Abstract
Outcome-only evaluation can certify economically unsafe agents: a policy can hit a business KPI while violating deployable behavioral discipline. In hotel pricing with hidden competitor state, a learner can achieve plausible revenue per available room while failing to preserve the rate discipline of a rule-based revenue-management competitor. We introduce discipline stability, a trace-based evaluation paradigm: define the benchmark behavior, restrict observations to the deployment regime, induce trace diagnostics from failure, separate mechanisms with ablations, and test transfer and deployment. Across a two-hotel benchmark and a compact hidden-budget bidding task, reward-only PPO variants miss trace alignment; revealing hidden state reduces label uncertainty; deterministic copy collapses uncertainty; and trace-prior or corrected history policies better preserve price or bid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
