Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou

TL;DR
This paper introduces Nyx, a novel mixed-modal retriever designed for universal retrieval-augmented generation, addressing the challenge of retrieving and reasoning over combined text and image data to enhance vision-language generation.
Contribution
The paper presents Nyx, a unified mixed-modal retriever and a new dataset NyxQA, along with a two-stage training framework, advancing retrieval-augmented generation for mixed-modal information.
Findings
Nyx outperforms existing methods on vision-language tasks.
Nyx achieves competitive results on standard RAG benchmarks.
The dataset NyxQA reflects real-world mixed-modal information needs.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
