Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Matthew Kowal; Goncalo Paulo; Louis Jaburi; Tom Tseng; Lev E McKinney; Stefan Heimersheim; Aaron David Tucker; Adam Gleave; Kellin Pelrine

arXiv:2602.14869·cs.AI·February 17, 2026

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, Kellin Pelrine

PDF

Open Access

TL;DR

This paper introduces Concept Influence, a scalable and interpretable method for training data attribution that leverages semantic directions within models to better understand and control model behavior.

Contribution

The paper proposes Concept Influence, a novel approach that uses semantic directions for data attribution, improving scalability and interpretability over traditional influence functions.

Findings

01

Concept Influence matches classical influence functions in performance.

02

Probe-based attribution methods are significantly faster.

03

Method improves understanding and control of model behavior.

Abstract

As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Domain Adaptation and Few-Shot Learning