More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Hongkai Lin; Dingkang Liang; Mingyang Du; Xin Zhou; Xiang Bai

arXiv:2510.23574·cs.CV·October 28, 2025

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai

PDF

2 Models

TL;DR

MERGE is a unified model that leverages pre-trained text-to-image diffusion models for both high-quality image generation and accurate depth estimation without degrading the original image generation capabilities.

Contribution

The paper introduces MERGE, a novel framework that enables seamless switching between image generation and depth estimation using a fixed pre-trained diffusion model.

Findings

01

MERGE achieves state-of-the-art results on multiple depth estimation benchmarks.

02

The model preserves the original image generation ability of the pre-trained diffusion model.

03

The play-and-plug framework simplifies switching between tasks with pluggable converters.

Abstract

Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.