Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation
Yongqi Li, Hongru Cai, Wenjie Wang, Leigang Qu, Yinwei Wei, Wenjie Li,, Liqiang Nie, Tat-Seng Chua

TL;DR
This paper introduces AVG, a novel autoregressive method that tokenizes images into vokens and formulates text-to-image retrieval as a token-to-voken generation task, improving alignment with semantics and visual information.
Contribution
The paper proposes a new generative approach for text-to-image retrieval that uses tokenized visual representations and combines generative and discriminative training for better performance.
Findings
Achieves superior retrieval effectiveness.
Demonstrates improved efficiency over existing methods.
Effectively aligns visual tokens with high-level semantics.
Abstract
Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
