Multimodal LLM-based Query Paraphrasing for Video Search
Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, Sheng-Hua Zhong, Xiong-Yong Wei, Qing Li

TL;DR
This paper introduces a multimodal LLM-based query paraphrasing approach for video search that enhances retrieval accuracy by addressing out-of-vocabulary issues and complex query reasoning through various transformations and a verification strategy.
Contribution
It proposes a novel multimodal LLM framework that uses paraphrasing and decomposition to improve video search, along with a verification method to reduce hallucinations.
Findings
Improved retrieval performance on TRECVid datasets.
Effective handling of complex and out-of-vocabulary queries.
Insights into the benefits of query paraphrasing for video search.
Abstract
Text-to-video retrieval answers user queries through searches based on concepts and embeddings. However, due to limitations in the size of the concept bank and the amount of training data, answering queries in the wild is not always effective because of the out-of-vocabulary problem. Furthermore, neither concept-based nor embedding-based search can perform reasoning to consolidate search results for complex queries that include logical and spatial constraints. To address these challenges, we leverage large language models (LLMs) to paraphrase queries using text-to-text (T2T), text-to-image (T2I), and image-to-text (I2T) transformations. These transformations rephrase abstract concepts into simpler terms to mitigate the out-of-vocabulary problem. Additionally, complex relationships within a query can be decomposed into simpler sub-queries, improving retrieval performance by effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
