$\epsilon$-Policy Gradient for Online Pricing
Lukasz Szpruch, Tanut Treetanthiploet, Yufei Zhang

TL;DR
This paper introduces an $$-policy gradient algorithm that combines model-based and model-free reinforcement learning for online pricing, achieving near-optimal regret bounds by balancing exploration and exploitation.
Contribution
It proposes a novel $$-policy gradient method that extends $$-greedy algorithms with gradient-based learning and analyzes its regret performance in online pricing.
Findings
Achieves expected regret of order $( ext{T})$ with logarithmic factors.
Balances exploration and exploitation effectively in online pricing.
Provides theoretical analysis of regret bounds for the proposed algorithm.
Abstract
Combining model-based and model-free reinforcement learning approaches, this paper proposes and analyzes an -policy gradient algorithm for the online pricing learning task. The algorithm extends -greedy algorithm by replacing greedy exploitation with gradient descent step and facilitates learning via model inference. We optimize the regret of the proposed algorithm by quantifying the exploration cost in terms of the exploration probability and the exploitation cost in terms of the gradient descent optimization and gradient estimation errors. The algorithm achieves an expected regret of order (up to a logarithmic factor) over trials.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications · Consumer Market Behavior and Pricing · Advanced Bandit Algorithms Research
