RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
Andrew Choi, Wei Xu

TL;DR
RankQ introduces a self-supervised ranking loss to improve offline-to-online reinforcement learning, enabling better policy refinement and transfer in sparse reward and vision-based robotic tasks.
Contribution
It proposes a novel ranking-based Q-learning objective that enhances value estimation by learning relative action preferences, outperforming prior methods in various benchmarks.
Findings
RankQ achieves state-of-the-art performance on D4RL benchmarks.
In robot learning, RankQ significantly improves simulation success rates.
RankQ enables effective sim-to-real transfer in robotic manipulation.
Abstract
Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
