TL;DR
This paper introduces F2RVLM, a novel generative retrieval model designed to improve fine-grained fragment retrieval in multimodal long-form dialogues, addressing coherence and relevance issues in existing vision-language models.
Contribution
The paper proposes F2RVLM, a two-stage training paradigm with curriculum sampling, and introduces new datasets for evaluating multimodal dialogue retrieval tasks.
Findings
F2RVLM outperforms existing vision-language models in retrieval accuracy.
The curriculum sampling strategy enhances reasoning in long, multi-turn dialogues.
New datasets enable realistic evaluation of multimodal dialogue retrieval.
Abstract
Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
