Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation
Jiajun Wang, Morteza Ghahremani, Yitong Li, Bj\"orn Ommer, Christian, Wachinger

TL;DR
Stable-Pose introduces a transformer-based adapter with coarse-to-fine attention masking to improve pose-guided text-to-image generation, significantly enhancing accuracy in complex human pose scenarios.
Contribution
It proposes a novel adapter model with hierarchical attention masking for better pose guidance in T2I models, leveraging ViT's self-attention for detailed pose representation.
Findings
Achieved 57.1 AP score on LAION-Human, 13% higher than ControlNet.
Effectively handles complex pose conditions like side and rear views.
Demonstrated superior performance across five public datasets.
Abstract
Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Motion and Animation · Video Analysis and Summarization · Advanced Vision and Imaging
MethodsAttention Is All You Need · Softmax · Adapter · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Diffusion · Adam · Residual Connection
