LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

Minh-Chi Phung; Thien-Bao Le; Cam-Tu Tran-Thi; Thu-Dieu Nguyen-Thi; Vu-Hung Dao

arXiv:2603.02888·cs.CV·March 4, 2026

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi, Thu-Dieu Nguyen-Thi, Vu-Hung Dao

PDF

Open Access 1 Video

TL;DR

LLandMark is a multi-agent framework for landmark-aware multimodal video retrieval that integrates landmark detection, reasoning, and retrieval to improve accuracy and cultural relevance, especially for Vietnamese scenes.

Contribution

The paper introduces a novel multi-agent system with LLM-assisted modules for landmark detection, query reformulation, and retrieval, enhancing multimodal video search capabilities.

Findings

01

Achieves culturally grounded, explainable retrieval results.

02

Uses LLMs to automate landmark detection and query generation.

03

Improves Vietnamese scene retrieval accuracy.

Abstract

The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques