F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Hanbo Bi; Zhiqiang Yuan; Zexi Jia; Jiapei Zhang; Chongyang Li; Peixiang Luo; Ying Deng; Xiaoyue Duan; Jinchao Zhang

arXiv:2508.17714·cs.CV·November 11, 2025

F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Hanbo Bi, Zhiqiang Yuan, Zexi Jia, Jiapei Zhang, Chongyang Li, Peixiang Luo, Ying Deng, Xiaoyue Duan, Jinchao Zhang

PDF

1 Video

TL;DR

This paper introduces F2RVLM, a novel generative retrieval model designed to improve fine-grained fragment retrieval in multimodal long-form dialogues, addressing coherence and relevance issues in existing vision-language models.

Contribution

The paper proposes F2RVLM, a two-stage training paradigm with curriculum sampling, and introduces new datasets for evaluating multimodal dialogue retrieval tasks.

Findings

01

F2RVLM outperforms existing vision-language models in retrieval accuracy.

02

The curriculum sampling strategy enhances reasoning in long, multi-turn dialogues.

03

New datasets enable realistic evaluation of multimodal dialogue retrieval.

Abstract

Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model· underline