Aligning Latent Spaces with Flow Priors
Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo

TL;DR
This paper introduces a flow-based prior method for aligning learnable latent spaces with arbitrary target distributions, improving efficiency and flexibility in generative modeling.
Contribution
It proposes a novel alignment loss that simplifies latent space regularization by leveraging flow models, eliminating the need for likelihood evaluations and ODE solving.
Findings
Effective alignment of latent spaces demonstrated on ImageNet
Alignment loss closely approximates negative log-likelihood
Theoretical proof of the surrogate objective's validity
Abstract
This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper handles an interesting problem using a novel mechanism - The paper provides theoretical and empirical analyses to support the proposed technique - The proposed method has high practical value and can be applied in a wide range of problems. In the experiments, the paper demonstrates it by aligning simple latents formed by ViT-L-based encoder-decoder systems to 4 different prior distributions, including low-level visual features from a VAE, continuous semantic visual features from Dino
- The method requires two training steps to align the space; thus, it aggregates errors from both steps. First, it depends heavily on the quality of the flow-matching model, which cannot capture the prior distribution perfectly. Second, the latent optimization process is also not guaranteed to converge. The authors should analyze error accumulation and the system's failure modes. - The method requires two training steps, which are expensive. Computation cost should be reported. - While the propo
- Simple (heuristic) objective that doesn't require ODE evaluation of flow models. - Did an experiment on ImageNet-1K.
- "vθ encapsulates the dynamics that transport probability mass from the base distribution pinit to the prior distribution pprior along linear path" - I don't think the $v_\theta$ captures movement along linear paths. It is trained with linear paths but the velocity field itself does not produce a linear path. Only optimal transport maps would produce linear paths but that is not solved via flow matching. - Figure 2 while intuitive in this case will break down when the init and prior are overl
* the toy example with a mixture of five isotropic 2D Gaussian (section 5.1) nicely illustrated the proposed approach and helps the reader to understand its principle. The paper also demonstrates a genuine desire of clarity by providing an "intuitive explanation" of the method in section 4.2. * for image generation, the experiments are conducted with four very different prior, namely low-level and semantic embedding (visual features), quantified visual features and even textual features. - th
* results of alignment in section 5.2 (line 397-412) are made through the *observation* of two curves and commenting their (asuumed) correlation, without reporting this last. There is no comparison to any baseline. * the result of image generation (section 5.2 and Table 1) are not really convincing. If one considers the results with classifier free guidance (that is, the best) the performance are close to the basic prior (AE, KL, SoftVQ) and sometimes worse, in particular in terms of Precicion
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · 3D Shape Modeling and Analysis
