ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
Tianze Xia, Zijian Ning, Zonglin Zhao, Mingjia Wang

TL;DR
ASTRA is a novel framework that improves multi-subject image generation by disentangling appearance and pose using retrieval augmentation and specialized positional encoding within a diffusion transformer.
Contribution
It introduces a dual-strategy approach combining retrieval-augmented pose priors and a new asymmetric position embedding to better preserve identity and pose in generated images.
Findings
Achieves state-of-the-art pose adherence on COCO-based benchmark.
Maintains high identity fidelity and text alignment in experiments.
Outperforms existing methods in complex multi-subject pose generation.
Abstract
Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion and pose distortion, as appearance and structure signals become entangled within the model's architecture. To resolve this conflict, we introduce ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation), a novel framework that architecturally disentangles subject appearance from pose structure within a unified Diffusion Transformer. ASTRA achieves this through a dual-pronged strategy. It first employs a Retrieval-Augmented Pose (RAG-Pose) pipeline to provide a clean, explicit structural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
