What Does Flow Matching Bring To TD Learning?
Bhavya Agrawalla, Michal Nauman, Aviral Kumar

TL;DR
Flow matching enhances TD learning by enabling test-time recovery and promoting plasticity in critics, leading to significant performance and sample efficiency gains in challenging RL settings.
Contribution
This work reveals that flow matching improves TD learning through integration-based value readout and dense velocity supervision, distinct from distributional RL explanations.
Findings
Flow-matching critics outperform monolithic critics by 2x in final performance.
Flow-matching critics are about 5x more sample-efficient.
Flow matching maintains stability in high-UTD online RL problems.
Abstract
Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
