WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Wei Chow; Jiachun Pan; Yongyuan Liang; Mingze Zhou; Xue Song; Liyu Jia; Saining Zhang; Siliang Tang; Juncheng Li; Fengda Zhang; Weijia Wu; Hanwang Zhang; Tat-Seng Chua

arXiv:2511.11434·cs.CV·November 17, 2025

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua

PDF

Open Access 1 Models 1 Datasets

TL;DR

WEAVE introduces a comprehensive benchmark suite and large-scale dataset for multi-turn, context-dependent multimodal comprehension and generation, addressing a key gap in existing visual understanding research.

Contribution

The paper presents WEAVE, the first dataset and benchmark for in-context interleaved multimodal comprehension and generation tasks, enabling better evaluation of multi-turn, context-aware models.

Findings

01

Training on WEAVE-100k improves multimodal understanding and editing capabilities.

02

Models develop emergent visual-memory abilities with WEAVE training.

03

Current models still face significant challenges in multi-turn, context-aware image generation.

Abstract

Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
WeiChow/Bagel-weave
model· 13 dl· ♡ 2
13 dl♡ 2

Datasets

WeiChow/WEAVE
dataset· 5.5k dl
5.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection