Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State
Peiying Zhu, Sidi Chang

TL;DR
This paper identifies failure modes in revenue management RL agents under partial observability, diagnosing issues with scalar reward metrics and proposing a trace-based diagnostic and repair method called Trace-Prior RL.
Contribution
It introduces a trace-level diagnostic protocol and a Trace-Prior RL method that learns a market prior to improve distributional pricing policies under hidden competitor states.
Findings
Deterministic RL collapses uncertainty into shortcut behaviors.
Trace-Prior RL matches market metrics within seed-level uncertainty.
Higher action accuracy can worsen trace alignment when optimizing distributionally.
Abstract
Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability. Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
