Iteratively Prompting Multimodal LLMs to Reproduce Natural and   AI-Generated Images

Ali Naseh; Katherine Thai; Mohit Iyyer; Amir Houmansadr

arXiv:2404.13784·cs.CR·April 23, 2024

Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images

Ali Naseh, Katherine Thai, Mohit Iyyer, Amir Houmansadr

PDF

Open Access

TL;DR

This paper demonstrates a method to generate images similar to AI marketplace outputs using multimodal models and prompts, highlighting economic and security implications, and provides a large dataset of prompt-image pairs.

Contribution

Introduces an attack strategy using fine-tuned CLIP, multi-label classifiers, and GPT-4V to replicate AI-generated images at lower costs, and releases a large dataset of prompt-image pairs.

Findings

01

Comparable images can be produced at $0.23-$0.27 each

02

Automated metrics and human assessments confirm image similarity

03

Releases a dataset of 19 million prompt-image pairs from Midjourney

Abstract

With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Image and Signal Denoising Methods · Generative Adversarial Networks and Image Synthesis

MethodsContrastive Language-Image Pre-training