Intriguing Properties of Data Attribution on Diffusion Models
Xiaosen Zheng, Tianyu Pang, Chao Du, Jing Jiang, Min Lin

TL;DR
This paper investigates data attribution methods for diffusion models, revealing surprising empirical results that challenge theoretical assumptions and proposing a more efficient attribution approach based on extensive experiments.
Contribution
It introduces a novel, efficient data attribution method for diffusion models and uncovers counter-intuitive empirical findings that question existing theoretical guidance.
Findings
Counter-intuitive design choices outperform baselines
The new method is more computationally efficient
Theoretical assumptions may hinder attribution performance
Abstract
Data attribution seeks to trace model outputs back to training data. With the recent development of diffusion models, data attribution has become a desired module to properly assign valuations for high-quality or copyrighted training samples, ensuring that data contributors are fairly compensated or credited. Several theoretically motivated methods have been proposed to implement data attribution, in an effort to improve the trade-off between computational scalability and effectiveness. In this work, we conduct extensive experiments and ablation studies on attributing diffusion models, specifically focusing on DDPMs trained on CIFAR-10 and CelebA, as well as a Stable Diffusion model LoRA-finetuned on ArtBench. Intriguingly, we report counter-intuitive observations that theoretically unjustified design choices for attribution empirically outperform previous baselines by a large margin,…
Peer Reviews
Decision·ICLR 2024 poster
The paper is simple and easy to follow. The extensive experiments on different settings provide solid evidence of D-TRAK performing better than TRAK in terms of LDS. Readers can be easily convinced that there is an issue with either TRAK as a data attribution method or LDS as a data attribution metric.
Despite the solid experiment results, the desiderata of a data attribution paper is different from an adversarial attack paper. For adversarial attacks, the success of an attack is a sufficient contribution. This could not be said for data attribution. Successfully finding techniques to optimize for a data attribution metric is only meaningful **if the technique reveals insight**, because in practice attackers have no control over the data attribution method. Therefore, unlike writing adversaria
- The authors thoroughly test their proposed method on a number of datasets - The authors present strong empirical results across a variety of settings
- Out of the listed baselines, to the best of my knowledge only Journey TRAK [1] has been explicitly used for diffusion models in previous work. As the authors note, Journey TRAK is not meant to be used to attribute the *final* image $x$ (i.e., the entire sampling trajectory). Rather, it is meant to attribute noisy images $x_t$ (i.e., specific denoising steps along the sampling trajectory). Thus, the direct comparison with Journey TRAK in the evaluation section is not on equal grounds. - For th
Data attribution is an interesting problem that's increasingly important with widely used generative models trained on web-scraped datasets. It's computationally and theoretically challenging, so developing new methods for this task is a valuable contribution. This work builds off of one the better-performing methods in the literature, TRAK, and observes performance for D-TRAK that makes it, to my knowledge, the most effective method available. And it preserves the advantages of TRAK, particular
The main weakness is that in terms of data attribution methodology, the contribution here is shallow. The paper essentially finds that a couple heuristics improve performance, and offers no explanation for why. The paper acknowledges this with statements like "the mechanism of data attribution requires a deeper understanding," which are true, but this is not ideal for a publication. A paper proposing a new and improved method should offer some understanding of why it works, and the paper barely
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
MethodsDiffusion
