VINCIE: Unlocking In-context Image Editing from Video
Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

TL;DR
VINCIE introduces a video-trained diffusion transformer for in-context image editing, enabling high-quality, multi-turn editing and concept composition without task-specific pipelines.
Contribution
The paper presents a scalable video annotation method and a novel transformer model trained on proxy tasks, advancing in-context image editing from videos.
Findings
Achieves state-of-the-art results on multi-turn editing benchmarks.
Demonstrates strong multi-concept composition and story generation abilities.
Operates effectively without task-specific training pipelines.
Abstract
In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art…
Peer Reviews
Decision·ICLR 2026 Poster
1. The central idea of using native video data to learn in-context editing makes sense. It cleverly reframes the problem, identifying videos as a natural, abundant, and scalable source of "edit" data (i.e., state transitions) that intrinsically contains the visual consistency and temporal context missing from static image-pair datasets. 2. The data construction pipeline is sound, combining a VLM for high-level semantic transition annotation with grounding models (GroundingDINO + SAM2) for prec
1. The paper defines in-context editing as modifying an image based on a "contextual sequence comprising text and previously generated images." This framing is functionally equivalent to multi-turn editing, where the primary role of the context is to serve as the input for the next step. This definition is quite narrow and overlooks a more common interpretation of "in-context" for image models: the ability to use one or more reference images in the context to provide new subjects, styles, or con
1. Well-structured with clear figures (Figures 1-3 effectively convey main ideas) 2. Comprehensive appendix with implementation details 3. Three proxy tasks (NIP, CSP, NSP) provide complementary learning signals 4. Comprehensive ablation studies validate design decisions
1. The claim of learning from "native videos" is misleading—the method requires extensive preprocessing with VLMs, GroundingDINO, and SAM2. This is annotation-based learning, not purely native video learning 2. Evaluation relies heavily on GPT-4o as judge, which may introduce bias despite correlation analysis (Table 7) 3. Core technical novelty is limited: DiT architecture, segmentation prediction, and in-context learning are established techniques. The main contribution is the data construction
S1) This paper proposed a new perspective in constructing session-wise data with long, interleaved image-text context from native videos while prior works that used video for editing were mainly for constructing pair-wise data. This highlights the originality. S2) At first glance the designed approach is best suited for videos with a static viewpoint. When the camera moves, the region proposal and segmentation process can break down, leading to inaccurate edits or mask predictions. Large backgr
W1) I find it surprising that the paper identifies MagicBrush (NeurIPS 2023) as the last major benchmark for multi-turn image editing, with no subsequent progress. Based on my recollection, ImgEdit-Bench (May 2025) also introduced evaluation protocols on multi-turn image editing. Including results on the ImgEdit-Bench benchmark would make the empirical comparison much more convincing.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsDiffusion
