CFM: Language-aligned Concept Foundation Model for Vision
Kai Wittenmayer, Sukrut Rao, Amin Parchami-Araghi, Bernt Schiele, Jonas Fischer

TL;DR
This paper introduces CFM, a vision foundation model that offers human-interpretable, spatially grounded concepts for better explainability across various vision tasks, while maintaining competitive performance.
Contribution
The work presents a novel model that provides fine-grained, spatially grounded concepts for vision tasks, enhancing interpretability without sacrificing accuracy.
Findings
CFM achieves competitive performance on classification, segmentation, and captioning.
Provides high-quality, fine-grained concept explanations.
Enables richer explanations through concept relationship analysis.
Abstract
Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making difficult. Recent work decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose CFM, a language-aligned concept foundation model for vision that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. When paired with a foundation model with strong semantic representations, we get explanations for any of its downstream tasks. Examining local co-occurrence dependencies of concepts allows us to define concept relationships through which we improve concept naming and obtain richer explanations. On benchmark data, we show that CFM provides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
