VIBE: Visual Instruction Based Editor

Grigorii Alekseenko; Aleksandr Gordeev; Irina Tolstykh; Bulat Suleimanov; Vladimir Dokholyan; Georgii Fedorov; Sergey Yakubson; Aleksandra Tsybina; Mikhail Chernyshov; Maksim Kuprashevich

arXiv:2601.02242·cs.CV·January 6, 2026

VIBE: Visual Instruction Based Editor

Grigorii Alekseenko, Aleksandr Gordeev, Irina Tolstykh, Bulat Suleimanov, Vladimir Dokholyan, Georgii Fedorov, Sergey Yakubson, Aleksandra Tsybina, Mikhail Chernyshov, Maksim Kuprashevich

PDF

Open Access 2 Models

TL;DR

VIBE introduces a compact, efficient instruction-based image editing pipeline using a 2B-parameter model guiding a 1.6B-parameter diffusion model, achieving high-quality edits with low computational cost and strict source preservation.

Contribution

The paper presents a novel, lightweight image editing system that matches or surpasses larger models in quality while significantly reducing inference costs and resource requirements.

Findings

01

Matches or exceeds performance of larger models on benchmarks

02

Operates within 24 GB GPU memory and 4 seconds per image

03

Maintains high quality and source consistency at 2K resolution

Abstract

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCell Image Analysis Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques