Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
Yiqing Shi, Yiren Song, Mike Zheng Shou

TL;DR
Edit2Perceive leverages image editing diffusion models to achieve state-of-the-art dense perception across depth, normal, and matting tasks, emphasizing structure preservation and efficiency.
Contribution
The paper introduces a unified diffusion framework, Edit2Perceive, that adapts editing models for dense perception tasks with structure-preserving refinement and faster inference.
Findings
State-of-the-art results across depth, normal, and matting tasks
Effective structure-preserving refinement during denoising
Faster inference with single-step deterministic approach
Abstract
Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Cell Image Analysis Techniques
