ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via   Modal Fusion Map

Yilin Ye; Shishi Xiao; Xingchen Zeng; Wei Zeng

arXiv:2407.12315·cs.CV·October 29, 2024·1 cites

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

Yilin Ye, Shishi Xiao, Xingchen Zeng, Wei Zeng

PDF

Open Access 1 Repo

TL;DR

ModalChorus is an interactive system that visualizes and aligns multi-modal embeddings, improving understanding and correction of cross-modal feature misalignments in vision-language models like CLIP.

Contribution

It introduces Modal Fusion Map, a novel dimensionality reduction technique, and an interactive alignment process for better visualization and correction of multi-modal embeddings.

Findings

01

MFM outperforms t-SNE and MDS in visualizing cross-modal features.

02

ModalChorus enables intuitive discovery of embedding misalignments.

03

System improves tasks like zero-shot classification and cross-modal retrieval.

Abstract

Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yilinye/modal-fusion-map
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection

MethodsContrastive Language-Image Pre-training