TL;DR
T2VIndexer introduces a generative sequence-to-sequence model for text-video retrieval that significantly reduces retrieval time while maintaining or improving accuracy across multiple datasets.
Contribution
The paper presents a novel generative model-based video indexer that enables constant-time retrieval and introduces encoding and augmentation techniques for semantic video representation.
Findings
Achieves 30-50% of original retrieval time with improved accuracy.
Enhances retrieval efficiency on four standard datasets.
Maintains high retrieval performance with semantic video encoding.
Abstract
Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
