V-Agent: An Interactive Video Search System Using Vision-Language Models
SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju

TL;DR
V-Agent is an innovative multi-agent system that leverages vision-language models and multimodal retrieval techniques to enable advanced, context-aware video search and interactive conversations.
Contribution
The paper introduces V-Agent, a novel multi-agent platform that combines fine-tuned vision-language models with multimodal retrieval for improved video search capabilities.
Findings
Achieves state-of-the-art zero-shot performance on MultiVENT 2.0 benchmark.
Effectively integrates visual and spoken content for context-aware search.
Demonstrates practical application potential with demo videos and open models.
Abstract
We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
