LLMs can see and hear without any training

Kumar Ashutosh; Yossi Gandelsman; Xinlei Chen; Ishan Misra; Rohit; Girdhar

arXiv:2501.18096·cs.CV·January 31, 2025

LLMs can see and hear without any training

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit, Girdhar

PDF

Open Access 1 Repo

TL;DR

MILS is a simple, training-free method that enhances multimodal capabilities of large language models by iterative prompting and scoring, achieving state-of-the-art results in zero-shot captioning and media generation.

Contribution

Introducing MILS, a training-free, iterative prompting approach that enables multimodal reasoning and applications in LLMs without additional training.

Findings

01

Achieves state-of-the-art zero-shot captioning for images, videos, and audio.

02

Improves text-to-image generation through prompt rewrites.

03

Enables cross-modal embedding inversion for applications like style transfer.

Abstract

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/mils
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDispute Resolution and Class Actions