LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval
Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi, Thu-Dieu Nguyen-Thi, Vu-Hung Dao

TL;DR
LLandMark is a multi-agent framework for landmark-aware multimodal video retrieval that integrates landmark detection, reasoning, and retrieval to improve accuracy and cultural relevance, especially for Vietnamese scenes.
Contribution
The paper introduces a novel multi-agent system with LLM-assisted modules for landmark detection, query reformulation, and retrieval, enhancing multimodal video search capabilities.
Findings
Achieves culturally grounded, explainable retrieval results.
Uses LLMs to automate landmark detection and query generation.
Improves Vietnamese scene retrieval accuracy.
Abstract
The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
