An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild
Aviv Gabbay, Niv Cohen, Yedid Hoshen

TL;DR
This paper introduces a method for disentangling factors of variation in images using limited supervision and off-the-shelf image descriptors, enabling effective manipulation of real-world images with minimal manual annotation.
Contribution
It proposes a novel approach that combines partial labeling and residual factor separation, leveraging CLIP embeddings for zero-shot attribute annotation in real images.
Findings
Achieves state-of-the-art disentangled image manipulation results.
Demonstrates effectiveness on synthetic benchmarks and real face images.
Uses minimal manual effort for attribute annotation.
Abstract
Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques
