Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts
Kiran Chhatre, Christopher Peters, Srikrishna Karanam

TL;DR
Spectrum leverages a fine-tuned 3D texture diffusion model to produce detailed, open-vocabulary human parsing, distinguishing diverse clothing and body parts with high accuracy across multiple datasets.
Contribution
The paper introduces Spectrum, a novel network that repurposes a 3D texture diffusion model for detailed, part-level human parsing with open-vocabulary capabilities.
Findings
Outperforms baseline methods in prompt-based segmentation tasks.
Effectively distinguishes diverse clothing categories and body parts.
Maintains faithful correspondence to input images for accurate parsing.
Abstract
Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
Topics3D Shape Modeling and Analysis · Image Processing and 3D Reconstruction · Generative Adversarial Networks and Image Synthesis
