UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
Jiayun Wang, Yu Wang, Weijie Gan, Zhenting Wang, Wei Wei

TL;DR
UniVL introduces a unified vision-language embedding for spatially grounded image generation, enabling controllable, efficient image synthesis based on spatial instructions without a separate text encoder.
Contribution
The paper presents a novel framework that binds semantics to spatial locations directly from a unified visual input, reducing computation and improving image quality.
Findings
Improves image quality with FID reduced from 14 to 11
Eliminates the need for a standalone text encoder
Reduces inference TFLOPs by up to 52%
Abstract
We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
