Text-driven 3D Human Generation via Contrastive Preference Optimization
Pengfei Zhou, Xukun Shen, Yong Hu

TL;DR
This paper introduces a contrastive preference optimization framework that enhances 3D human generation from text by improving alignment and realism, especially for complex descriptions, through preference-guided score distillation sampling.
Contribution
The paper proposes a novel preference optimization module with negation preferences to better align 3D models with complex textual inputs, addressing limitations of existing SDS methods.
Findings
Achieves state-of-the-art alignment accuracy.
Improves texture realism and visual fidelity.
Effectively handles long and complex textual descriptions.
Abstract
Recent advances in Score Distillation Sampling (SDS) have improved 3D human generation from textual descriptions. However, existing methods still face challenges in accurately aligning 3D models with long and complex textual inputs. To address this challenge, we propose a novel framework that introduces contrastive preferences, where human-level preference models, guided by both positive and negative prompts, assist SDS for improved alignment. Specifically, we design a preference optimization module that integrates multiple models to comprehensively capture the full range of textual features. Furthermore, we introduce a negation preference module to mitigate over-optimization of irrelevant details by leveraging static-dynamic negation prompts, effectively preventing ``reward hacking". Extensive experiments demonstrate that our method achieves state-of-the-art results, significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Video Analysis and Summarization
