TL;DR
MERGE is a unified model that leverages pre-trained text-to-image diffusion models for both high-quality image generation and accurate depth estimation without degrading the original image generation capabilities.
Contribution
The paper introduces MERGE, a novel framework that enables seamless switching between image generation and depth estimation using a fixed pre-trained diffusion model.
Findings
MERGE achieves state-of-the-art results on multiple depth estimation benchmarks.
The model preserves the original image generation ability of the pre-trained diffusion model.
The play-and-plug framework simplifies switching between tasks with pluggable converters.
Abstract
Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
