TL;DR
Ar2Can is a two-stage framework that improves multi-human image generation by separately planning spatial layouts and rendering identities, achieving high accuracy and fidelity with synthetic data.
Contribution
The paper introduces a novel disentangled approach with a spatially-guided face matching reward, enhancing multi-human scene generation quality and identity preservation.
Findings
Significant improvements in count accuracy and identity preservation.
High perceptual quality of generated images.
Effective use of synthetic data without real multi-human images.
Abstract
Despite recent advances in personalized image generation, existing models consistently fail to produce reliable multi-human scenes, often merging or losing facial identity. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect predicts structured layouts, specifying where each person should appear. The Artist then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model. This is optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
