Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

Yijia Xu; Zihao Wang; Jinshi Cui

arXiv:2602.03448·cs.CV·February 4, 2026

Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

Yijia Xu, Zihao Wang, Jinshi Cui

PDF

Open Access

TL;DR

This paper introduces a hierarchical guidance framework for multi-subject image generation that enhances identity consistency and compositional control by explicitly linking high-level concepts to detailed appearances using structured supervision.

Contribution

It proposes a novel Hierarchical Concept-to-Appearance Guidance framework with a VAE dropout strategy and a correspondence-aware attention module, improving multi-subject image synthesis.

Findings

01

Achieves state-of-the-art results in multi-subject image generation

02

Improves prompt following accuracy and subject consistency

03

Enhances attribute binding through structured supervision

Abstract

Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis