TL;DR
This paper introduces Hybrid Contrastive Quantization (HCQ), a novel method for efficient cross-view video retrieval that balances high retrieval performance with reduced storage and computational costs through multi-level quantization and contrastive learning.
Contribution
HCQ is the first quantized representation learning approach for cross-view video retrieval, combining coarse and fine-grained quantizations with contrastive learning to improve efficiency and robustness.
Findings
HCQ achieves competitive accuracy with state-of-the-art non-compressed methods.
HCQ significantly reduces storage and computation requirements.
Extensive experiments validate HCQ's effectiveness on benchmark datasets.
Abstract
With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
