VINCIE: Unlocking In-context Image Editing from Video

Leigang Qu; Feng Cheng; Ziyan Yang; Qi Zhao; Shanchuan Lin; Yichun Shi; Yicong Li; Wenjie Wang; Tat-Seng Chua; Lu Jiang

arXiv:2506.10941·cs.CV·March 3, 2026

VINCIE: Unlocking In-context Image Editing from Video

Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

PDF

Open Access 1 Models 2 Datasets 3 Reviews

TL;DR

VINCIE introduces a video-trained diffusion transformer for in-context image editing, enabling high-quality, multi-turn editing and concept composition without task-specific pipelines.

Contribution

The paper presents a scalable video annotation method and a novel transformer model trained on proxy tasks, advancing in-context image editing from videos.

Findings

01

Achieves state-of-the-art results on multi-turn editing benchmarks.

02

Demonstrates strong multi-concept composition and story generation abilities.

03

Operates effectively without task-specific training pipelines.

Abstract

In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The central idea of using native video data to learn in-context editing makes sense. It cleverly reframes the problem, identifying videos as a natural, abundant, and scalable source of "edit" data (i.e., state transitions) that intrinsically contains the visual consistency and temporal context missing from static image-pair datasets. 2. The data construction pipeline is sound, combining a VLM for high-level semantic transition annotation with grounding models (GroundingDINO + SAM2) for prec

Weaknesses

1. The paper defines in-context editing as modifying an image based on a "contextual sequence comprising text and previously generated images." This framing is functionally equivalent to multi-turn editing, where the primary role of the context is to serve as the input for the next step. This definition is quite narrow and overlooks a more common interpretation of "in-context" for image models: the ability to use one or more reference images in the context to provide new subjects, styles, or con

Reviewer 02Rating 4Confidence 4

Strengths

1. Well-structured with clear figures (Figures 1-3 effectively convey main ideas) 2. Comprehensive appendix with implementation details 3. Three proxy tasks (NIP, CSP, NSP) provide complementary learning signals 4. Comprehensive ablation studies validate design decisions

Weaknesses

1. The claim of learning from "native videos" is misleading—the method requires extensive preprocessing with VLMs, GroundingDINO, and SAM2. This is annotation-based learning, not purely native video learning 2. Evaluation relies heavily on GPT-4o as judge, which may introduce bias despite correlation analysis (Table 7) 3. Core technical novelty is limited: DiT architecture, segmentation prediction, and in-context learning are established techniques. The main contribution is the data construction

Reviewer 03Rating 8Confidence 5

Strengths

S1) This paper proposed a new perspective in constructing session-wise data with long, interleaved image-text context from native videos while prior works that used video for editing were mainly for constructing pair-wise data. This highlights the originality. S2) At first glance the designed approach is best suited for videos with a static viewpoint. When the camera moves, the region proposal and segmentation process can break down, leading to inaccurate edits or mask predictions. Large backgr

Weaknesses

W1) I find it surprising that the paper identifies MagicBrush (NeurIPS 2023) as the last major benchmark for multi-turn image editing, with no subsequent progress. Based on my recollection, ImgEdit-Bench (May 2025) also introduced evaluation protocols on multi-turn image editing. Including results on the ImgEdit-Bench benchmark would make the empirical comparison much more convincing.

Code & Models

Models

🤗
ByteDance-Seed/VINCIE-3B
model· 39 dl· ♡ 42
39 dl♡ 42

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsDiffusion