Beyond the Proxy: Trajectory-Distilled Guidance for Offline GFlowNet Training
Ruishuo Chen, Xun Wang, Rui Hu, Zhuoran Li, Longbo Huang

TL;DR
This paper introduces TD-GFN, a proxy-free offline GFlowNet training method that uses inverse reinforcement learning to learn detailed rewards, improving exploration and robustness while preventing error propagation.
Contribution
TD-GFN is a novel proxy-free framework that learns dense transition rewards from offline data, enhancing exploration and robustness in offline GFlowNet training.
Findings
Outperforms existing methods in convergence speed
Achieves higher sample quality
Demonstrates robustness in offline settings
Abstract
Generative Flow Networks (GFlowNets) are effective at sampling diverse, high-reward objects, but in many real-world settings where new reward queries are infeasible, they must be trained from offline datasets. The prevailing proxy-based training methods are susceptible to error propagation, while existing proxy-free approaches often use coarse constraints that limit exploration. To address these issues, we propose Trajectory-Distilled GFlowNet (TD-GFN), a novel proxy-free training framework. TD-GFN learns dense, transition-level edge rewards from offline trajectories via inverse reinforcement learning to provide rich structural guidance for efficient exploration. Crucially, to ensure robustness, these rewards are used indirectly to guide the policy through DAG pruning and prioritized backward sampling of training trajectories. This ensures that final gradient updates depend only on…
Peer Reviews
Decision·Submitted to ICLR 2026
**Novel Proxy-Free Paradigm**: Departing from existing paradigms, this work pioneers a proxy-free approach that leverages estimated edge rewards. This novel framework effectively circumvents the key limitations inherent in both proxy-based methods, such as error propagation, and prior proxy-free methods, which often rely on coarse-grained constraints. **High Efficiency**: By integrating DAG pruning and prioritized backward sampling, TD-GFN sets a new state of the art for offline GFlowNet traini
**Insufficient Analysis for Design Choices** : Although this paper does a pretty good ablation study in the appendix, it lacks analysis on the detailed design choices and what problems this design might lead to. For instance, pruning seems like a more 'extreme' version of weighted sampling, so would it also work to remove this part while somewhat adjusting the weighted sampling method to make it 'harsher' for those low-reward edges? In practice, purely clean datasets are often difficult to obtai
* Turning trajectory data into dense, structural guidance via IRL-derived edge rewards is new in the GFlowNet offline setting. * The pruning criterion and prioritized backward sampler are simple and interpretable. * Pipeline and training objectives are well explained.
1. The claimed advantages of TD-GFN rest primarily on HyperGrid benchmarks, but the chosen grid size ($8^4$) is too small to offer convincing evidence. Moreover, HyperGrids are homogeneous across dimensions, so grid height (also equal to the trajectory length) matters far more than dimensionality. With such a short trajectory length ($=8$), environment-DAG pruning becomes less meaningful. Compared to girds with large trajectory length (e.g. $256$ and $512$), just making random transitions ca
The authors study an interesting setting of learning GFlowNets from online data without relying on proxy reward models, which has high practical potential in my opinion. The paper is generally well-written and has a good structure. The presented experimental evaluation is thorough, presents a large number of baselines for comparison, studies the performance of the proposed method on data of different quality, provides various ablation studies, and uses a broad range of metrics for comparison of
I believe that this is a solid paper, however, there is one crucial weakness that prevents me from recommending acceptance in its current form. The setting studied in the paper is learning GFlowNets from trajectory-level data, i.e. data consisting of trajectories $s_0 \to s_1 \to \dots \to s_n$ and rewards $R(s_n)$ given for terminal states. However, the paper has no experiments on non-synthetic trajectory-level data, thus I believe it is hard to make conclusions about the potential utility of t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Generative Adversarial Networks and Image Synthesis
MethodsADaptive gradient method with the OPTimal convergence rate · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
