Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability
Zhiyu Zhu, Zhibo Jin, Jiayu Zhang, Nan Yang, Jiahao Huang, Jianlong, Zhou, Fang Chen

TL;DR
This paper introduces the Narrowing Information Bottleneck Theory, a new framework that improves the interpretability of multimodal image-text models like CLIP by satisfying attribution axioms and demonstrating significant performance gains.
Contribution
It proposes a novel theoretical framework that redefines traditional bottleneck methods to enhance interpretability in multimodal models, addressing limitations of previous approaches.
Findings
Image interpretability improved by 9%
Text interpretability improved by 58.83%
Processing speed increased by 63.95%
Abstract
The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP's interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
