A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models

Michail Mamalakis; Tiago Azevedo; Cristian Cosentino; Chiara D'Ercoli; Subati Abulikemu; Zhongtian Sun; Richard Bethlehem; Pietro Lio

arXiv:2601.17952·cs.CL·January 27, 2026

A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models

Michail Mamalakis, Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio

PDF

Open Access

TL;DR

This paper presents a unified interpretability framework for large language models in clinical neuroscience, enhancing stability and trustworthiness of explanations by extracting monosemantic features and reducing variability across attribution methods.

Contribution

It introduces a novel monosemantic attribution framework that aligns interpretability with model inputs and outputs, improving stability and reliability in clinical applications.

Findings

01

Produces stable importance scores for clinical LLMs

02

Reduces inter-method variability in explanations

03

Enhances trustworthy model interpretation in neurodegenerative diagnosis

Abstract

Interpretability remains a key challenge for deploying large language models (LLMs) in clinical settings such as Alzheimer's disease progression diagnosis, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an LLM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education