Discrete Flow Matching for Offline-to-Online Reinforcement Learning
Fairoz Nower Khan, Nabuat Zaman Nahim, Peizhong Ju

TL;DR
This paper introduces DRIFT, a novel offline-to-online reinforcement learning method for discrete action spaces that leverages a continuous-time Markov chain policy with advantage-weighted flow matching and candidate-set approximation.
Contribution
It proposes a new fine-tuning approach for discrete RL that preserves pretrained knowledge and efficiently adapts to new data using a path-space penalty and candidate-set sampling.
Findings
DRIFT achieves stable offline-to-online improvement across multiple discrete RL tasks.
The path-space penalty remains bounded during fine-tuning, aiding stability.
Candidate-set approximation improves efficiency and convergence in large action spaces.
Abstract
Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
