Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Yabo Zhang; Kunchang Li; Dewei Zhou; Xinyu Huang; Xun Wang

arXiv:2605.12305·cs.CV·May 13, 2026

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Yabo Zhang, Kunchang Li, Dewei Zhou, Xinyu Huang, Xun Wang

PDF

TL;DR

INSET is a unified model embedding images as native tokens within text, improving multi-image instruction understanding and generation by leveraging transformer locality and a large synthetic dataset.

Contribution

The paper introduces INSET, a novel model that seamlessly integrates images into textual instructions and a scalable data engine for training on 15 million synthetic interleaved samples.

Findings

01

INSET outperforms state-of-the-art in multi-image consistency and text alignment.

02

Performance gaps increase with input complexity.

03

The approach extends naturally to multimodal image editing.

Abstract

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.