Generative Actor Critic
Aoyang Qin, Deqian Kong, Wei Wang, Ying Nian Wu, Song-Chun Zhu, Sirui Xie

TL;DR
Generative Actor Critic (GAC) introduces a generative modeling approach to reinforcement learning, enabling effective offline policy evaluation and improvement through trajectory and return modeling, with strong empirical results.
Contribution
GAC reframes RL policy evaluation and improvement as generative modeling and inference, introducing a latent variable model for offline RL that enhances offline-to-online transfer.
Findings
Strong offline performance on Gym-MuJoCo and Maze2D benchmarks.
Significantly improved offline-to-online transfer compared to state-of-the-art methods.
Effective even without step-wise rewards.
Abstract
Conventional Reinforcement Learning (RL) algorithms, typically focused on estimating or maximizing expected returns, face challenges when refining offline pretrained models with online experiences. This paper introduces Generative Actor Critic (GAC), a novel framework that decouples sequential decision-making by reframing \textit{policy evaluation} as learning a generative model of the joint distribution over trajectories and returns, , and \textit{policy improvement} as performing versatile inference on this learned model. To operationalize GAC, we introduce a specific instantiation based on a latent variable model that features continuous latent plan vectors. We develop novel inference strategies for both \textit{exploitation}, by optimizing latent plans to maximize expected returns, and \textit{exploration}, by sampling latent plans conditioned on dynamically adjusted…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Principled formulation — Viewing RL as generative inference offers a clear conceptual bridge between diffusion-style generative models and reinforcement learning. 2. Trajectory-level modeling — Operates on trajectory distributions rather than step-wise transitions, enabling sparse-reward settings. 3. Solid performance on standard benchmarks — Demonstrates strong offline-to-online adaptation even without step-level reward signals.
1. Outdated baselines The experimental comparison primarily uses older O2O RL methods (e.g., ODT, Cal-QL, LPT). Missing comparisons with more recent approaches. This makes it difficult to judge whether improvements stem from the GAC formulation or simply from implementation differences. 2. Limited novelty / incremental over Decision Diffuser Conceptually, GAC can be seen as a latent-variable extension of Decision Diffuser with a minor modification of the target formulation (the optimistic $y^
The paper is well-written and analyzes the proposed approach holistically on many aspects and does so very well with convincing evidence. The paper also well motivates the problem and it's easy to follow the motivation all the way through the proposed solutions. I also liked the analysis done in Figure 3 that intuitively shows the effect of online-finetuning as well as Figure 4 which analyzes the quality of the actor and the critic. Figure 5 also motivates a natural extension and future work by
- As pointed out by the authors, GAC and LPT both suffer from the inability to incorporate step-wise rewards which should be used whenever available and Table 8 reflects the consequence of not incorporating them. That being said, it would be interesting to see a heuristic for autonomous re-planning that can for example incorporate those step-wise rewards to "mine" for interesting states and use them. - On one hand, the GAC paper builds on top of the LPT paper and proposes some modifications to
* The first strength of this manuscript is the joint distribution estimation for trajectory and return, which differs from the method in traditional RL, and this may be particularly suitable for long horizon planning. * The online fine-tuning step in GAC is critical. The exploration idea of $y^{+} = y+\Delta y$ seems an efficient way for better exploration. * The idea of applying the latent variable in generative modeling is appealing. It combines the good merits in generative modeling (dim
1. Some key hyper-parameters relies on manual tuning, and the robustness of the proposed method needs to be improved. For example, the $\Delta y$ is important for online fine tuning, however, the authors have not proposed the auto-tuning method. This also applies for the hyper-parameters in latent variable modeling. 2. The reviewer appreciate the idea of modeling the joint distribution of trajectory and returns. However, it would be better if additional theoretical analysis are provided. Why t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Advanced Bandit Algorithms Research
