Rethink Sparse Signals for Pose-guided Text-to-image Generation
Wenjie Xuan, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao

TL;DR
This paper introduces SP-Ctrl, a novel method that enhances sparse pose signals for text-to-image generation, achieving superior control and alignment while overcoming challenges associated with dense representations.
Contribution
We propose a learnable spatial representation and keypoint concept learning to improve sparse pose guidance in text-to-image generation, outperforming recent methods.
Findings
Outperforms recent spatially controllable T2I methods with sparse pose guidance
Matches the performance of dense signal-based methods in pose-guided generation
Demonstrates effective cross-species and diverse generation capabilities
Abstract
Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications
MethodsOpenPose
