Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation
Yijia Xu, Zihao Wang, Jinshi Cui

TL;DR
This paper introduces a hierarchical guidance framework for multi-subject image generation that enhances identity consistency and compositional control by explicitly linking high-level concepts to detailed appearances using structured supervision.
Contribution
It proposes a novel Hierarchical Concept-to-Appearance Guidance framework with a VAE dropout strategy and a correspondence-aware attention module, improving multi-subject image synthesis.
Findings
Achieves state-of-the-art results in multi-subject image generation
Improves prompt following accuracy and subject consistency
Enhances attribute binding through structured supervision
Abstract
Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
