Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach
Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie

TL;DR
DiTFuse is a unified, instruction-driven diffusion-transformer framework that enables semantics-aware, controllable image fusion across multiple modalities and tasks, with improved robustness and generalization.
Contribution
The paper introduces DiTFuse, a novel end-to-end model that jointly encodes images and natural language instructions for flexible, high-level semantic image fusion.
Findings
Outperforms existing methods on IVIF, MFF, and MEF benchmarks.
Supports multi-level user control and zero-shot generalization.
Achieves sharper textures and better semantic retention.
Abstract
Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Fusion Techniques · Image Enhancement Techniques · Remote-Sensing Image Classification
