TL;DR
SCFlow introduces a flow-based model that learns to merge style and content representations invertibly, enabling natural disentanglement without explicit supervision, and demonstrates strong zero-shot generalization on large image datasets.
Contribution
It proposes a novel flow-matching framework that bypasses explicit disentanglement by learning invertible style-content merging, supported by a large synthetic dataset.
Findings
Disentanglement emerges naturally from invertible merging.
SCFlow generalizes well to ImageNet-1k and WikiArt in zero-shot.
Achieves competitive performance without explicit disentanglement supervision.
Abstract
Explicitly disentangling style and content in vision models remains challenging due to their semantic overlap and the subjectivity of human perception. Existing methods propose separation through generative or discriminative objectives, but they still face the inherent ambiguity of disentangling intertwined concepts. Instead, we ask: Can we bypass explicit disentanglement by learning to merge style and content invertibly, allowing separation to emerge naturally? We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. Our approach is built upon three key insights: 1) Training solely to merge style and content, a well-defined task, enables invertible disentanglement without explicit supervision; 2) flow matching bridges on arbitrary distributions, avoiding the restrictive Gaussian priors of diffusion models and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
