Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Shun Wang; Tyler Loakman; Youbo Lei; Yi Liu; Bohao Yang; Yuting Zhao; Dong Yang; Chenghua Lin

arXiv:2507.06427·cs.CL·July 10, 2025

Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Shun Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang, Yuting Zhao, Dong Yang, Chenghua Lin

PDF

Open Access

TL;DR

This paper introduces a method using sparse autoencoders to interpret large language models, extracting meaningful features, identifying misunderstandings, and improving task performance through prompt reformulation.

Contribution

It presents a novel approach combining dictionary learning and autoencoders to interpret LLMs and enhance downstream task accuracy.

Findings

01

Extracts monosemantic features from polysemantic neurons

02

Identifies internal misunderstandings in models

03

Improves downstream task performance with prompt reformulation

Abstract

Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning