Evaluating GFlowNet from partial episodes for stable and flexible policy-based training
Puhua Niu, Shili Wu, Xiaoning Qian

TL;DR
This paper introduces an evaluation balance method for GFlowNets that improves the reliability and flexibility of policy-based training by effectively estimating policy divergence from partial episodes, applicable to synthetic and real-world tasks.
Contribution
It presents a novel evaluation balance objective that enhances divergence estimation and supports parameterized backward policies and offline data integration in GFlowNets.
Findings
Evaluation balance improves training stability.
Method supports parameterized backward policies.
Effective on both synthetic and real-world tasks.
Abstract
Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As…
Peer Reviews
Decision·ICLR 2026 Poster
1. Sub-EB gives an alternative way for optimization of value-based and policy-based GFlowNet; 2. Both on-policy and off-policy training regimes are considered, which broadens the method's applicability.
1. The result analogous to theorems 1-2 in current submission are well-known from the point of view of entropy-regularized (soft) RL, see [Tiapkin et al, 2024]. In particular, Proposition 1 in [Tiapkin et al, 2024] is precisely the result of Theorem 1 for complete trajectories. In light of this result, novelty and technical originality of theorems 1 and 2 is limited. 2. It’s unclear how much faster the algorithm is in wall-clock time, rather than in terms of the number of steps; 3. The paper d
- Strong theoretical analysis and results - I appreciated the theoretical connections established between the evaluation function and the flow function.
N/A
- The first (to my knowledge) policy-iteration style approach for GFlowNet training, which shows a consistent performance on various benchmarks, including a high-dimensional one such as 10-vertex BNs and SEH;
- The theoretical results automatically seem to follow from the GFlowNet-RL equivalence, described in the works (Deleu et al. 2024, Tiapkin et al. 2024). In particular, thanks to the graded DAG structure, Theorems 3.1 and 3.2 follow from Proposition 1 of Tiapkin et al. (2024), applied to a sub-DAG that is rooted at the vertex $s_h$, as well as the interpretation of a value function as negative KL-divergence up to a log-normalizing constant. Also, the paper does not cite a related work by Tiapkin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Domain Adaptation and Few-Shot Learning
