CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

Zhanxin Gao; Beier Zhu; Liang Yao; Jian Yang; Ying Tai

arXiv:2507.08396·cs.CV·February 3, 2026

CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

Zhanxin Gao, Beier Zhu, Liang Yao, Jian Yang, Ying Tai

PDF

Open Access 3 Reviews

TL;DR

CoDi introduces a two-stage diffusion-based framework that achieves subject consistency and pose diversity in text-to-image generation, enhancing visual storytelling without sacrificing layout variety.

Contribution

It presents a novel two-stage method using optimal transport and feature refinement to improve subject consistency and pose diversity in T2I models.

Findings

01

Outperforms existing methods in subject consistency and pose diversity.

02

Maintains high prompt fidelity and visual quality.

03

Demonstrates effectiveness through extensive qualitative and quantitative evaluations.

Abstract

Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1.The key innovation is the explicit decoupling of identity alignment into coarse-grained transport in early steps and fine-grained refinement in later steps, which is a well-motivated approach based on the progressive nature of diffusion models. 2.A significant strength is the superior balance it achieves. As claimed, the paper provides strong evidence that the method outperforms existing training-free baselines in subject consistency while preserving significantly greater pose diversity and te

Weaknesses

1.For long-story generation scenarios, it is crucial to maintain consistency in both character identity and their apparel. However, the results presented in the paper demonstrate that the method primarily ensures identity consistency, while the consistency of clothing remains inadequate. In my opinion, this limitation would significantly restrict the method's practicality in long-story applications. 2.The method has a core reliance on binary masks derived from cross-attention maps to extract id

Reviewer 02Rating 6Confidence 4

Strengths

1. This work proposes an effective method to improve pose diversity. 2. This work is clearly expressed and easy to understand. 3. This work introduces Optimal transport into Subject-consistent generation.

Weaknesses

1. This model was tested on SDXL, but its effectiveness was not verified on the DiT architecture. 2. The qualitative and quantitative experimental results of this work did not show significant improvement. 3. The long description in lines L126-L131 seems informal in the main text.

Reviewer 03Rating 2Confidence 5

Strengths

- The paper is well written and easy to follow. - The method proposed in the paper is training-free and can be directly applied during inference. - The paper includes comprehensive comparative experiments and ablation studies. The design of evaluation metrics also demonstrates certain insights, especially regarding "pose diversity".

Weaknesses

- Optimal Transport (OT) essentially addresses the optimization problem of transforming one probability distribution into another with minimal cost. However, the problem in **IT** is to find the feature matching relationship between the reference image and the target image. Clearly, a straightforward ranking based on cosine similarity would be simpler and more efficient. In contrast, the "globally optimal transport" property of OT not only complicates the problem but may also introduce redundanc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis