VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers
Zhiwen Li, Zhongjie Duan, Jinyan Ye, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

TL;DR
VIRAL introduces a unified framework for visual in-context learning by leveraging visual analogy and a diffusion transformer, enabling versatile visual reasoning and editing across diverse tasks.
Contribution
The paper presents VIRAL, a novel approach that adapts a diffusion transformer with role-aware conditioning and Mixture-of-Experts LoRA for effective visual in-context learning.
Findings
VIRAL outperforms existing methods on multiple visual tasks.
The framework effectively handles open-domain editing.
A large-scale dataset was curated to support diverse visual reasoning tasks.
Abstract
Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy (). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
