MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

Yingying Wang; Xuanhua He; Chen Wu; Jialing Huang; Suiyun Zhang; Rui Liu; Xinghao Ding; Haoxuan Che

arXiv:2512.15261·cs.CV·December 18, 2025

MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

Yingying Wang, Xuanhua He, Chen Wu, Jialing Huang, Suiyun Zhang, Rui Liu, Xinghao Ding, Haoxuan Che

PDF

Open Access 1 Video

TL;DR

MMMamba is a versatile, efficient cross-modal fusion framework that enhances pan-sharpening and zero-shot image super-resolution by leveraging in-context conditioning and a novel interleaved scanning mechanism.

Contribution

It introduces MMMamba, a novel in-context fusion framework with linear complexity and a multimodal interleaved mechanism, advancing pan-sharpening and zero-shot image enhancement capabilities.

Findings

01

Outperforms state-of-the-art methods on multiple benchmarks.

02

Supports zero-shot image super-resolution.

03

Maintains linear computational complexity.

Abstract

Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MMMamba: A Versatile Cross-Modal in Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement· underline

Taxonomy

TopicsAdvanced Image Fusion Techniques · Image Enhancement Techniques · Advanced Image Processing Techniques