Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
Yabo Zhang, Kunchang Li, Dewei Zhou, Xinyu Huang, Xun Wang

TL;DR
INSET is a unified model embedding images as native tokens within text, improving multi-image instruction understanding and generation by leveraging transformer locality and a large synthetic dataset.
Contribution
The paper introduces INSET, a novel model that seamlessly integrates images into textual instructions and a scalable data engine for training on 15 million synthetic interleaved samples.
Findings
INSET outperforms state-of-the-art in multi-image consistency and text alignment.
Performance gaps increase with input complexity.
The approach extends naturally to multimodal image editing.
Abstract
While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
