TL;DR
DROL introduces a dynamic routing method for offline reinforcement learning, enabling one-step actors to locally improve actions while maintaining efficient inference, outperforming some existing methods on benchmark tasks.
Contribution
The paper proposes DROL, a novel latent-conditioned one-step actor with top-1 dynamic routing, allowing local action improvements without sacrificing inference efficiency.
Findings
DROL performs competitively on OGBench and D4RL benchmarks.
It improves many OGBench task groups compared to baseline methods.
DROL remains effective on AntMaze and Adroit environments.
Abstract
One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one-step extraction pipelines, a strong iterative teacher provides one target action for each latent draw, and the same student output is asked to do both jobs: move toward higher Q and stay near that paired endpoint. If those two directions disagree, the loss resolves them as a compromise on that same sample, even when a nearby better action remains locally supported by the data. We propose DROL, a latent-conditioned one-step actor trained with top-1 dynamic routing. For each state, the actor samples candidate actions from a bounded latent prior, assigns each dataset action to its nearest candidate, and updates only that winner…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
