UniCoRN: Unified Commented Retrieval Network with LMMs
Maximilian Jaritz, Matthieu Guillaumin, Sabine Sternig, Loris Bazzani

TL;DR
UniCoRN is a unified multimodal retrieval and generation framework that enhances complex visual question answering by integrating retrieval and language models, demonstrating significant improvements on retrieval and commenting tasks.
Contribution
It introduces an entity adapter module and a new Commented Retrieval task with dataset, combining retrieval and generation in a single frozen LMM.
Findings
+4.5% recall over state-of-the-art in composed retrieval
+14.9% METEOR / +18.4% BEM over RAG in commenting
Effective integration of retrieval and generation capabilities
Abstract
Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
