# HADIS: Hybrid Adaptive Diffusion Model Serving for Efficient Text-to-Image Generation

**Authors:** Qizheng Yang, Tung-I Chen, Siyu Zhao, Ramesh K. Sitaraman, Hui Guan

arXiv: 2509.00642 · 2026-01-07

## TL;DR

HADIS introduces a hybrid diffusion model serving system that adaptively routes text-to-image queries to optimize latency and quality, significantly outperforming existing systems.

## Contribution

The paper proposes a novel hybrid architecture and system, HADIS, for adaptive diffusion model serving that improves latency and response quality through query-aware routing and resource optimization.

## Key findings

- Up to 35% improvement in response quality.
- Latency violation rates reduced by 2.7-45 times.
- Effective offline profiling for resource management.

## Abstract

Text-to-image diffusion models have achieved remarkable visual quality but incur high computational costs, making latency-aware, scalable deployment challenging. To address this, we advocate a hybrid architecture that achieves query awareness when serving diffusion models. Unlike existing query-aware serving systems that cascade lightweight and heavyweight models with a fixed configuration, our hybrid architecture first routes each query directly to a suitable model variant, then reroutes it to a cascaded heavyweight model only if necessary. We theoretically analyze conditions for the hybrid architecture to outperform non-hybrid alternatives in latency and response quality. Building on this architecture, we design HADIS, a hybrid serving system for latency-aware diffusion models that jointly optimizes cascade model selection, query routing, and resource allocation. To reduce the complexity of resource management, HADIS uses an offline profiling phase to produce a Pareto-optimal cascade configuration table. At runtime, HADIS selects the best cascade configuration and GPU allocation given latency and workload constraints. Empirical evaluations on real-world traces demonstrate that HADIS improves response quality by up to 35% while reducing latency violation rates by 2.7-45$\times$ compared to state-of-the-art model serving systems.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00642/full.md

## Figures

20 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00642/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/2509.00642/full.md

---
Source: https://tomesphere.com/paper/2509.00642