A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models
Michail Mamalakis, Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio

TL;DR
This paper presents a unified interpretability framework for large language models in clinical neuroscience, enhancing stability and trustworthiness of explanations by extracting monosemantic features and reducing variability across attribution methods.
Contribution
It introduces a novel monosemantic attribution framework that aligns interpretability with model inputs and outputs, improving stability and reliability in clinical applications.
Findings
Produces stable importance scores for clinical LLMs
Reduces inter-method variability in explanations
Enhances trustworthy model interpretation in neurodegenerative diagnosis
Abstract
Interpretability remains a key challenge for deploying large language models (LLMs) in clinical settings such as Alzheimer's disease progression diagnosis, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an LLM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
