LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection
Ana Vasilcoiu, Ivona Najdenkoska, Zeno Geradts, Marcel Worring

TL;DR
LATTE introduces a novel method that models the evolution of latent embeddings across multiple denoising steps to effectively detect diffusion-generated images, outperforming existing approaches in various challenging scenarios.
Contribution
LATTE is the first to model latent trajectory evolution across denoising steps for improved diffusion-generated image detection.
Findings
Achieves superior detection accuracy on multiple benchmarks.
Excels in cross-generator and cross-dataset scenarios.
Demonstrates the effectiveness of latent trajectory modeling.
Abstract
The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This erodes trust in digital media, making it critical to develop generated image detectors that remain reliable across different generators. While recent approaches leverage diffusion denoising cues, they typically rely on single-step reconstruction errors and overlook the sequential nature of the denoising process. In this work, we propose LATTE - LATent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across multiple denoising steps. Instead of treating each denoising step in isolation, LATTE captures the trajectory of these representations, revealing subtle and discriminative patterns that distinguish real from generated images. Experiments on several benchmarks, such as GenImage, Chameleon, and Diffusion…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Achieves good results on several benchmarks outperforming several prominent baselines.
Weaknesses 1. Equations 5/6, rely on the randomly sampled noise (epsilon). At higher noise levels, this would cause huge differences in the outputs produced by the network, hence there is a chance that this could affect the features extracted by the neural network. To clarify these doubts, the authors should present all the results as a mean/standard deviation over 10-15 evaluations. This sheds doubts upon one result (in Fig 3, training on SDv1.4 generalizes to BigGAN, but training on SDv1.5 doe
The paper focuses on a timely and important topic. The paper is generally well-written and easy to follow. The images are clear .
The paper has the following concerns: 1. Given the nature of the proposed method, the inference speed should be very slow, which limits the real-world deployment. And there is no experiments focusing on efficiency. 2. The paper only focuses on diffusion-generated images. However, the SOTA generation models, such as GPT-Image-1 and Nano Banana maybe auto-regression based. 3. No latest model like FLUX is discussed. 4. The novelty of the proposed method is below the bar of ICLR. For myself, Latent
- The authors’ key idea is that because of how diffusion models work, we do not need to limit ourselves to learning a classifier solely using the image pixels of real and fake images. Instead, we can obtain their intermediate latents, which lead up to the final image, and see how all those latents differ for real and fake images. This is a very reasonable idea, and, to the best of my knowledge, has not been explored before. - The authors have evaluated their method on multiple datasets and sho
- While the idea of using latents of real and fake images makes sense, there are many other areas where the paper lacks strong motivation and analysis. One of these is the idea of mixing the latents with the visual image features. I will list a couple of sections of the paper which are somewhat vague and do not provide a concrete explanation about this process. - Line 194 - 195: Why is it important to ground the latent features to the visual context through the attention mechanism? In other
1. The paper is easy to read and the overall methodology is clear. 2. The evaluation is thorough and includes several baselines, cross-domain comparison and a robustness analysis to common corruptions. 3. The components of the proposed method are ablated such that the contribution of different features is understood.
1. The idea of using information in trajectories is not new. Consider including a discussion on the connection of this work with prior findings. For example, departure from single-step errors to exploit the multi-step nature of diffusion has been previously established in [1]. 2. The authors justification for the inclusion of LATTE in the overall pipeline is unclear. The ablations in Table 4 show that the LATTE features alone are not effective for the task. Instead, it is the visual backbone fe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · AI in cancer detection
