DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification
Robert Zimmermann, Thomas Norrenbrock, Bodo Rosenhahn

TL;DR
DINO-QPM enhances the interpretability of visual foundation models like DINOv2 by converting complex features into human-understandable, spatially localized representations without retraining the backbone.
Contribution
This work introduces DINO-QPM, a lightweight adapter that makes frozen DINOv2 features globally interpretable and improves explanation quality and accuracy.
Findings
DINO-QPM surpasses DINOv2 linear probe in accuracy.
DINO-QPM provides spatial localization of features within images.
DINO-QPM outperforms other methods in interpretability metrics.
Abstract
Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
