MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto

TL;DR
This paper introduces MedSAE, a sparse autoencoder approach applied to MedCLIP's latent space, enhancing interpretability of medical vision-language models for chest radiographs.
Contribution
It presents a novel interpretability framework combining correlation, entropy, and automated neuron naming, demonstrating improved interpretability over raw features.
Findings
MedSAE neurons show higher monosemanticity.
Enhanced interpretability compared to raw MedCLIP features.
Bridges high performance and transparency in medical AI.
Abstract
Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
