UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

Jiayun Wang; Yu Wang; Weijie Gan; Zhenting Wang; Wei Wei

arXiv:2605.21611·cs.CV·May 22, 2026

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

Jiayun Wang, Yu Wang, Weijie Gan, Zhenting Wang, Wei Wei

PDF

TL;DR

UniVL introduces a unified vision-language embedding for spatially grounded image generation, enabling controllable, efficient image synthesis based on spatial instructions without a separate text encoder.

Contribution

The paper presents a novel framework that binds semantics to spatial locations directly from a unified visual input, reducing computation and improving image quality.

Findings

01

Improves image quality with FID reduced from 14 to 11

02

Eliminates the need for a standalone text encoder

03

Reduces inference TFLOPs by up to 52%

Abstract

We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.