TL;DR
This paper introduces ViReD, a novel interactive framework for video retrieval that uses dialog-based user feedback and a question generator guided by information theory to improve retrieval accuracy.
Contribution
It presents a multimodal question generator with information-guided supervision that enhances video retrieval through interactive dialog, outperforming traditional static systems.
Findings
Interactive dialog improves retrieval accuracy.
The question generator effectively incorporates visual and linguistic cues.
The approach generalizes to real-world human interactions.
Abstract
The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent. Our novel multimodal question generator learns to ask questions that maximize the subsequent video retrieval performance using (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
