Attributing Learned Concepts in Neural Networks to Training Data

Nicholas Konz; Charles Godfrey; Madelyn Shapiro; Jonathan Tu; Henry; Kvinge; Davis Brown

arXiv:2310.03149·cs.LG·December 29, 2023

Attributing Learned Concepts in Neural Networks to Training Data

Nicholas Konz, Charles Godfrey, Madelyn Shapiro, Jonathan Tu, Henry, Kvinge, Davis Brown

PDF

Open Access

TL;DR

This paper investigates which training data most influence learned concepts in neural networks, using data attribution and probing methods, revealing that concepts are formed from diffuse, robust features rather than specific examples.

Contribution

It introduces a combined approach using data attribution and concept probing to identify influential training data for neural network concepts.

Findings

01

Removing top attributing images does not alter concept location

02

Concept features are diffusely spread across training data

03

Concept formation shows robustness and convergence

Abstract

By now there is substantial evidence that deep learning models learn certain human-interpretable features as part of their internal representations of data. As having the right (or wrong) concepts is critical to trustworthy machine learning systems, it is natural to ask which inputs from the model's original training set were most important for learning a concept at a given layer. To answer this, we combine data attribution methods with methods for probing the concepts learned by a model. Training network and probe ensembles for two concept datasets on a range of network layers, we use the recently developed TRAK method for large-scale data attribution. We find some evidence for convergence, where removing the 10,000 top attributing images for a concept and retraining the model does not change the location of the concept in the network nor the probing sparsity of the concept. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning