Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

Yunlong Wang; Jinjin Shi; Wenbin Gao; Xuran Xu; Runyu Shi; Ying Huang

arXiv:2605.20640·cs.CV·May 21, 2026

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

Yunlong Wang, Jinjin Shi, Wenbin Gao, Xuran Xu, Runyu Shi, Ying Huang

PDF

TL;DR

This paper introduces a novel feature supervision paradigm for Multimodal Diffusion Transformers that enhances portrait generation by improving alignment, realism, and aesthetics simultaneously without additional inference costs.

Contribution

It proposes a lightweight cross-modal alignment mechanism and aesthetic signal mining to overcome the trade-offs in text-to-image portrait generation.

Findings

01

Improves text-image alignment, realism, and aesthetics simultaneously.

02

Maintains original model generalization without overfitting.

03

Achieves state-of-the-art results on MM-DiT benchmarks.

Abstract

Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.