FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann

TL;DR
FullFlow is a parameter-efficient method that upgrades pretrained text-to-image diffusion models into bidirectional vision--language generators, enabling diverse tasks without extensive retraining.
Contribution
It introduces a lightweight adaptation approach using LoRA adapters to add bidirectional capabilities to existing text-to-image models without full retraining.
Findings
Significantly improves bidirectional generation metrics over previous state-of-the-art.
Reduces VRAM usage and increases training throughput substantially.
Supports downstream tasks like VQA with partial-text generation.
Abstract
Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling textimage, imagetext, joint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
