V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park; Jong-Hyeon Lee; Youngjune Kim; Daegyu Sung; Younghyun Yu; Young-rok Cha; Jeongho Ju

arXiv:2512.16925·cs.CV·January 8, 2026

V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju

PDF

Open Access 1 Models

TL;DR

V-Agent is an innovative multi-agent system that leverages vision-language models and multimodal retrieval techniques to enable advanced, context-aware video search and interactive conversations.

Contribution

The paper introduces V-Agent, a novel multi-agent platform that combines fine-tuned vision-language models with multimodal retrieval for improved video search capabilities.

Findings

01

Achieves state-of-the-art zero-shot performance on MultiVENT 2.0 benchmark.

02

Effectively integrates visual and spoken content for context-aware search.

03

Demonstrates practical application potential with demo videos and open models.

Abstract

We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
NCSOFT/GME-VARCO-VISION-Embedding
model· 171 dl· ♡ 12
171 dl♡ 12

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques