LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators
Allen Roush, Emil Zakirov, Artemiy Shirokov, Polina Lunina, Jack Gane,, Alexander Duffy, Charlie Basil, Aber Whitcomb, Jim Benedetto, Chris DeWolfe

TL;DR
This paper introduces LaDi, a system leveraging Large Language Models as Art Directors to improve the quality and coherence of images and videos generated from text prompts, enhancing artistic relevance.
Contribution
The paper presents a unified system called LaDi that integrates multiple techniques to enhance text-to-media generation using LLMs as Art Directors.
Findings
LaDi improves the artistic coherence of generated media.
Techniques like constrained decoding and intelligent prompting enhance output quality.
LaDi is actively used in commercial applications by Plai Labs.
Abstract
Recent advancements in text-to-image generation have revolutionized numerous fields, including art and cinema, by automating the generation of high-quality, context-aware images and video. However, the utility of these technologies is often limited by the inadequacy of text prompts in guiding the generator to produce artistically coherent and subject-relevant images. In this paper, We describe the techniques that can be used to make Large Language Models (LLMs) act as Art Directors that enhance image and video generation. We describe our unified system for this called "LaDi". We explore how LaDi integrates multiple techniques for augmenting the capabilities of text-to-image generators (T2Is) and text-to-video generators (T2Vs), with a focus on constrained decoding, intelligent prompting, fine-tuning, and retrieval. LaDi and these techniques are being used today in apps and platforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
MethodsFocus
