Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image   Generation

Jiajun Wang; Morteza Ghahremani; Yitong Li; Bj\"orn Ommer; Christian; Wachinger

arXiv:2406.02485·cs.CV·November 6, 2024

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Jiajun Wang, Morteza Ghahremani, Yitong Li, Bj\"orn Ommer, Christian, Wachinger

PDF

Open Access 1 Repo 1 Video

TL;DR

Stable-Pose introduces a transformer-based adapter with coarse-to-fine attention masking to improve pose-guided text-to-image generation, significantly enhancing accuracy in complex human pose scenarios.

Contribution

It proposes a novel adapter model with hierarchical attention masking for better pose guidance in T2I models, leveraging ViT's self-attention for detailed pose representation.

Findings

01

Achieved 57.1 AP score on LAION-Human, 13% higher than ControlNet.

02

Effectively handles complex pose conditions like side and rear views.

03

Demonstrated superior performance across five public datasets.

Abstract

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai-med/stablepose
pytorchOfficial

Videos

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation· slideslive

Taxonomy

TopicsHuman Motion and Animation · Video Analysis and Summarization · Advanced Vision and Imaging

MethodsAttention Is All You Need · Softmax · Adapter · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Diffusion · Adam · Residual Connection