ViLL-E: Video LLM Embeddings for Retrieval
Rohit Gupta, Jayakrishnan Unnikrishnan, Fan Fei, Sheng Liu, Son Tran, Mubarak Shah

TL;DR
ViLL-E is a unified VideoLLM architecture that enhances retrieval tasks by combining generative and contrastive training, achieving state-of-the-art performance and zero-shot capabilities in video understanding.
Contribution
Introduces ViLL-E, a novel VideoLLM with a unique embedding mechanism and a three-stage training process for improved retrieval and localization.
Findings
Improves temporal localization by an average of 7% over other VideoLLMs.
Enhances video retrieval accuracy up to 4% over dual encoder models.
Achieves state-of-the-art zero-shot performance in composed video retrieval and long text retrieval.
Abstract
Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
