VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Zhiwen Li; Zhongjie Duan; Jinyan Ye; Cen Chen; Daoyuan Chen; Yaliang Li; Yingda Chen

arXiv:2602.03210·cs.CV·February 4, 2026

VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Zhiwen Li, Zhongjie Duan, Jinyan Ye, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

PDF

Open Access

TL;DR

VIRAL introduces a unified framework for visual in-context learning by leveraging visual analogy and a diffusion transformer, enabling versatile visual reasoning and editing across diverse tasks.

Contribution

The paper presents VIRAL, a novel approach that adapts a diffusion transformer with role-aware conditioning and Mixture-of-Experts LoRA for effective visual in-context learning.

Findings

01

VIRAL outperforms existing methods on multiple visual tasks.

02

The framework effectively handles open-domain editing.

03

A large-scale dataset was curated to support diverse visual reasoning tasks.

Abstract

Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ( $x_{s} : x_{t} :: x_{q} : y_{q}$ ). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning