Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images
Ali Naseh, Katherine Thai, Mohit Iyyer, Amir Houmansadr

TL;DR
This paper demonstrates a method to generate images similar to AI marketplace outputs using multimodal models and prompts, highlighting economic and security implications, and provides a large dataset of prompt-image pairs.
Contribution
Introduces an attack strategy using fine-tuned CLIP, multi-label classifiers, and GPT-4V to replicate AI-generated images at lower costs, and releases a large dataset of prompt-image pairs.
Findings
Comparable images can be produced at $0.23-$0.27 each
Automated metrics and human assessments confirm image similarity
Releases a dataset of 19 million prompt-image pairs from Midjourney
Abstract
With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Image and Signal Denoising Methods · Generative Adversarial Networks and Image Synthesis
MethodsContrastive Language-Image Pre-training
