Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

Hanzhong Guo; Yizhou Yu

arXiv:2605.20807·cs.CV·May 21, 2026

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

Hanzhong Guo, Yizhou Yu

PDF

TL;DR

This paper introduces a two-stage framework for subject-driven image generation that enhances detail preservation by decoupling structure and appearance, supported by a new dataset and improved evaluation methods.

Contribution

The authors propose an intermediate structural prediction approach and a large text-aware dataset to improve high-fidelity subject-driven image synthesis.

Findings

01

Experiments show significant improvements over baseline methods.

02

GPT-4.1-based evaluation confirms the effectiveness of structural prediction.

03

Knowledge distillation indicates better detail preservation.

Abstract

Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.