Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li

TL;DR
This paper introduces a multi-grained text-guided image fusion method that leverages hierarchical textual descriptions and saliency-driven modules to improve fusion quality in challenging multi-exposure and multi-focus scenarios.
Contribution
It proposes a novel multi-grained textual guidance framework with hierarchical cross-modal modulation and saliency enrichment to enhance image fusion performance.
Findings
Outperforms previous methods on multi-exposure fusion tasks.
Effectively aligns visual and textual features at multiple granularities.
Enhances fusion quality with dense semantic content augmentation.
Abstract
Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Fusion Techniques · Image Enhancement Techniques · Visual Attention and Saliency Detection
