DeCLIP: Decoding CLIP representations for deepfake localization
Stefan Smeu, Elisabeta Oneata, Dan Oneata

TL;DR
DeCLIP introduces a method leveraging large self-supervised model features, like CLIP, combined with a convolutional decoder to detect and localize deepfake manipulations, including challenging cases like latent diffusion models.
Contribution
This work is the first to utilize pretrained CLIP features for local manipulation detection and localization, improving generalization across different generative models.
Findings
Pretrained features enable effective localization of manipulated regions.
The approach generalizes well to latent diffusion models.
Combining CLIP with a convolutional decoder enhances detection robustness.
Abstract
Generative models can create entirely new images, but they can also partially modify real images in ways that are undetectable to the human eye. In this paper, we address the challenge of automatically detecting such local manipulations. One of the most pressing problems in deepfake detection remains the ability of models to generalize to different classes of generators. In the case of fully manipulated images, representations extracted from large self-supervised models (such as CLIP) provide a promising direction towards more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage such large pretrained features for detecting local manipulations. We show that, when combined with a reasonably large convolutional decoder, pretrained self-supervised representations are able to perform localization and improve generalization capabilities over existing methods. Unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection
MethodsDiffusion
