TL;DR
This paper introduces a scalable method for computing influence functions in large Transformer models, enabling analysis of training data impact on predictions for models with hundreds of millions of parameters.
Contribution
It presents a novel approach using Arnoldi iteration to efficiently compute inverse Hessians, allowing influence functions to scale to full-size language and vision Transformer models.
Findings
Successfully scaled influence functions to models with hundreds of millions of parameters.
Demonstrated effectiveness on image classification and sequence-to-sequence tasks.
Provided open-source code for implementation.
Abstract
We address efficient calculation of influence functions for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration. With this improvement, we achieve, to the best of our knowledge, the first successful implementation of influence functions that scales to full-size (language and vision) Transformer models with several hundreds of millions of parameters. We evaluate our approach on image classification and sequence-to-sequence tasks with tens to a hundred of millions of training examples. Our code will be available at https://github.com/google-research/jax-influence.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Softmax · Residual Connection · Adam · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections
