Intriguing Properties of Data Attribution on Diffusion Models

Xiaosen Zheng; Tianyu Pang; Chao Du; Jing Jiang; Min Lin

arXiv:2311.00500·cs.LG·March 18, 2024·1 cites

Intriguing Properties of Data Attribution on Diffusion Models

Xiaosen Zheng, Tianyu Pang, Chao Du, Jing Jiang, Min Lin

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates data attribution methods for diffusion models, revealing surprising empirical results that challenge theoretical assumptions and proposing a more efficient attribution approach based on extensive experiments.

Contribution

It introduces a novel, efficient data attribution method for diffusion models and uncovers counter-intuitive empirical findings that question existing theoretical guidance.

Findings

01

Counter-intuitive design choices outperform baselines

02

The new method is more computationally efficient

03

Theoretical assumptions may hinder attribution performance

Abstract

Data attribution seeks to trace model outputs back to training data. With the recent development of diffusion models, data attribution has become a desired module to properly assign valuations for high-quality or copyrighted training samples, ensuring that data contributors are fairly compensated or credited. Several theoretically motivated methods have been proposed to implement data attribution, in an effort to improve the trade-off between computational scalability and effectiveness. In this work, we conduct extensive experiments and ablation studies on attributing diffusion models, specifically focusing on DDPMs trained on CIFAR-10 and CelebA, as well as a Stable Diffusion model LoRA-finetuned on ArtBench. Intriguingly, we report counter-intuitive observations that theoretically unjustified design choices for attribution empirically outperform previous baselines by a large margin,…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The paper is simple and easy to follow. The extensive experiments on different settings provide solid evidence of D-TRAK performing better than TRAK in terms of LDS. Readers can be easily convinced that there is an issue with either TRAK as a data attribution method or LDS as a data attribution metric.

Weaknesses

Despite the solid experiment results, the desiderata of a data attribution paper is different from an adversarial attack paper. For adversarial attacks, the success of an attack is a sufficient contribution. This could not be said for data attribution. Successfully finding techniques to optimize for a data attribution metric is only meaningful **if the technique reveals insight**, because in practice attackers have no control over the data attribution method. Therefore, unlike writing adversaria

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The authors thoroughly test their proposed method on a number of datasets - The authors present strong empirical results across a variety of settings

Weaknesses

- Out of the listed baselines, to the best of my knowledge only Journey TRAK [1] has been explicitly used for diffusion models in previous work. As the authors note, Journey TRAK is not meant to be used to attribute the *final* image $x$ (i.e., the entire sampling trajectory). Rather, it is meant to attribute noisy images $x_t$ (i.e., specific denoising steps along the sampling trajectory). Thus, the direct comparison with Journey TRAK in the evaluation section is not on equal grounds. - For th

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Data attribution is an interesting problem that's increasingly important with widely used generative models trained on web-scraped datasets. It's computationally and theoretically challenging, so developing new methods for this task is a valuable contribution. This work builds off of one the better-performing methods in the literature, TRAK, and observes performance for D-TRAK that makes it, to my knowledge, the most effective method available. And it preserves the advantages of TRAK, particular

Weaknesses

The main weakness is that in terms of data attribution methodology, the contribution here is shallow. The paper essentially finds that a couple heuristics improve performance, and offers no explanation for why. The paper acknowledges this with statements like "the mechanism of data attribution requires a deeper understanding," which are true, but this is not ideal for a publication. A paper proposing a new and improved method should offer some understanding of why it works, and the paper barely

Code & Models

Repositories

sail-sg/d-trak
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques

MethodsDiffusion