ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs
Landon Butler, Abhineet Agarwal, Justin Singh Kang, Yigit Efe Erginbas, Bin Yu, Kannan Ramchandran

TL;DR
ProxySPEX introduces a hierarchical, inference-efficient method for interpreting large language models by identifying sparse feature interactions, significantly reducing computational costs while improving interpretability accuracy.
Contribution
It leverages the hierarchical nature of feature interactions in LLMs to enable scalable, accurate interaction attribution with fewer inferences than previous methods.
Findings
More faithful output reconstruction by 20% over marginal attribution
Uses 10x fewer inferences than SPEX
Effectively identifies influential feature interactions in high-dimensional data
Abstract
Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs . Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical -- higher-order interactions are accompanied by their lower-order subsets -- which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
