Narrowing Information Bottleneck Theory for Multimodal Image-Text   Representations Interpretability

Zhiyu Zhu; Zhibo Jin; Jiayu Zhang; Nan Yang; Jiahao Huang; Jianlong; Zhou; Fang Chen

arXiv:2502.14889·cs.CV·February 24, 2025

Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability

Zhiyu Zhu, Zhibo Jin, Jiayu Zhang, Nan Yang, Jiahao Huang, Jianlong, Zhou, Fang Chen

PDF

TL;DR

This paper introduces the Narrowing Information Bottleneck Theory, a new framework that improves the interpretability of multimodal image-text models like CLIP by satisfying attribution axioms and demonstrating significant performance gains.

Contribution

It proposes a novel theoretical framework that redefines traditional bottleneck methods to enhance interpretability in multimodal models, addressing limitations of previous approaches.

Findings

01

Image interpretability improved by 9%

02

Text interpretability improved by 58.83%

03

Processing speed increased by 63.95%

Abstract

The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP's interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.