Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling
Jiale Liu, Haoming Zhou, Yishu Liu, Bingzhi Chen, Yuncheng Jiang

TL;DR
This paper introduces a novel fine-grained image-text alignment method that uses significance-aware and uncertainty modeling to improve robustness and interpretability in multimodal tasks.
Contribution
It proposes a unified framework with significance-aware and granularity-aware modeling, plus region-level uncertainty, to address limitations of existing cross-modal alignment methods.
Findings
Achieves state-of-the-art results on Flickr30K and MS-COCO datasets.
Enhances robustness and interpretability of fine-grained alignment.
Improves generalization in complex scenes.
Abstract
Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
