Using Multimodal Foundation Models and Clustering for Improved Style   Ambiguity Loss

James Baker

arXiv:2407.12009·cs.CV·July 18, 2024

Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

James Baker

PDF

Open Access

TL;DR

This paper introduces a novel style ambiguity training method for text-to-image models that enhances creativity without needing classifiers or labeled datasets, outperforming traditional approaches based on automated human judgment metrics.

Contribution

The work proposes a new style ambiguity training objective that does not rely on classifiers or labeled data, improving creativity in diffusion models.

Findings

01

Enhanced creativity in diffusion models demonstrated

02

Outperforms traditional style ambiguity methods

03

Maintains creativity and novelty

Abstract

Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis