Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss
James Baker

TL;DR
This paper introduces a novel style ambiguity training method for text-to-image models that enhances creativity without needing classifiers or labeled datasets, outperforming traditional approaches based on automated human judgment metrics.
Contribution
The work proposes a new style ambiguity training objective that does not rely on classifiers or labeled data, improving creativity in diffusion models.
Findings
Enhanced creativity in diffusion models demonstrated
Outperforms traditional style ambiguity methods
Maintains creativity and novelty
Abstract
Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
