MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Riccardo Renzulli; Colas Lepoutre; Enrico Cassano; Marco Grangetto

arXiv:2510.26411·cs.AI·October 31, 2025

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto

PDF

TL;DR

This paper introduces MedSAE, a sparse autoencoder approach applied to MedCLIP's latent space, enhancing interpretability of medical vision-language models for chest radiographs.

Contribution

It presents a novel interpretability framework combining correlation, entropy, and automated neuron naming, demonstrating improved interpretability over raw features.

Findings

01

MedSAE neurons show higher monosemanticity.

02

Enhanced interpretability compared to raw MedCLIP features.

03

Bridges high performance and transparency in medical AI.

Abstract

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.