TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models
Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang, Nie, Tat-Seng Chua

TL;DR
This paper introduces TIGeR, a unified large multimodal model that combines text-to-image generation and retrieval, enabling more creative and knowledge-intensive visual content synthesis and retrieval within a single framework.
Contribution
It proposes a novel unified framework for text-to-image generation and retrieval using a single large multimodal model, including an efficient generative retrieval method and an autonomous decision mechanism.
Findings
Outperforms existing methods on TIGeR-Bench, Flickr30K, and MS-COCO.
Demonstrates effective unification of generation and retrieval tasks.
Achieves superior results in both creative and knowledge-intensive domains.
Abstract
How humans can effectively and efficiently acquire images has always been a perennial question. A classic solution is text-to-image retrieval from an existing database; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in text-to-image generation have made it possible to produce attractive and counterfactual visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval, proposing a unified framework for both tasks with one single Large Multimodal Model (LMM). Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner. Subsequently, we unify generation and retrieval autoregressively and propose an autonomous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
