Multimodal LLM-based Query Paraphrasing for Video Search

Jiaxin Wu; Chong-Wah Ngo; Wing-Kwong Chan; Sheng-Hua Zhong; Xiong-Yong Wei; Qing Li

arXiv:2407.12341·cs.MM·August 14, 2025·1 cites

Multimodal LLM-based Query Paraphrasing for Video Search

Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, Sheng-Hua Zhong, Xiong-Yong Wei, Qing Li

PDF

Open Access

TL;DR

This paper introduces a multimodal LLM-based query paraphrasing approach for video search that enhances retrieval accuracy by addressing out-of-vocabulary issues and complex query reasoning through various transformations and a verification strategy.

Contribution

It proposes a novel multimodal LLM framework that uses paraphrasing and decomposition to improve video search, along with a verification method to reduce hallucinations.

Findings

01

Improved retrieval performance on TRECVid datasets.

02

Effective handling of complex and out-of-vocabulary queries.

03

Insights into the benefits of query paraphrasing for video search.

Abstract

Text-to-video retrieval answers user queries through searches based on concepts and embeddings. However, due to limitations in the size of the concept bank and the amount of training data, answering queries in the wild is not always effective because of the out-of-vocabulary problem. Furthermore, neither concept-based nor embedding-based search can perform reasoning to consolidate search results for complex queries that include logical and spatial constraints. To address these challenges, we leverage large language models (LLMs) to paraphrase queries using text-to-text (T2T), text-to-image (T2I), and image-to-text (I2T) transformations. These transformations rephrase abstract concepts into simpler terms to mitigate the out-of-vocabulary problem. Additionally, complex relationships within a query can be decomposed into simpler sub-queries, improving retrieval performance by effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications