TL;DR
This paper explores using large language model encoders to incorporate prior knowledge and pseudo-events into video moment retrieval, improving inter-concept relation modeling and achieving state-of-the-art results.
Contribution
It introduces a novel framework utilizing LLM encoders for refining multimodal embeddings in VMR, overcoming limitations of decoders and transferring refinement capabilities to other embeddings.
Findings
LLM encoders effectively refine inter-concept relations in multimodal embeddings.
Refinement capabilities transfer to embeddings like BLIP and T5 with similar inter-concept patterns.
Achieved state-of-the-art performance in video moment retrieval.
Abstract
In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Inverse Square Root Schedule · SentencePiece · Dropout · Contrastive Language-Image Pre-training · Layer Normalization · Linear Layer
