Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

Jiale Liu; Haoming Zhou; Yishu Liu; Bingzhi Chen; Yuncheng Jiang

arXiv:2511.07710·cs.CV·December 2, 2025

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

Jiale Liu, Haoming Zhou, Yishu Liu, Bingzhi Chen, Yuncheng Jiang

PDF

Open Access

TL;DR

This paper introduces a novel fine-grained image-text alignment method that uses significance-aware and uncertainty modeling to improve robustness and interpretability in multimodal tasks.

Contribution

It proposes a unified framework with significance-aware and granularity-aware modeling, plus region-level uncertainty, to address limitations of existing cross-modal alignment methods.

Findings

01

Achieves state-of-the-art results on Flickr30K and MS-COCO datasets.

02

Enhances robustness and interpretability of fine-grained alignment.

03

Improves generalization in complex scenes.

Abstract

Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling