SCoCCA: Multi-modal Sparse Concept Decomposition via Canonical Correlation Analysis
Ehud Gordon, Meir Yossef Levi, Guy Gilboa

TL;DR
SCoCCA introduces a novel multi-modal concept decomposition framework using Canonical Correlation Analysis to improve interpretability and disentanglement of vision-language models, achieving state-of-the-art results in concept discovery.
Contribution
It proposes Sparse Concept CCA (SCoCCA), a new method that aligns cross-modal embeddings and enforces sparsity for better interpretability and concept disentanglement.
Findings
Achieves state-of-the-art in concept discovery tasks
Enhances interpretability through sparse, discriminative concepts
Improves concept ablation and semantic manipulation results
Abstract
Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model's behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
