ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

Hyojin Park; Yi Li; Janghoon Cho; Sungha Choi; Jungsoo Lee; Taotao Jing; Shuai Zhang; Munawar Hayat; Dashan Gao; Ning Bi; Fatih Porikli

arXiv:2603.22872·cs.CV·March 25, 2026

ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

Hyojin Park, Yi Li, Janghoon Cho, Sungha Choi, Jungsoo Lee, Taotao Jing, Shuai Zhang, Munawar Hayat, Dashan Gao, Ning Bi, Fatih Porikli

PDF

Open Access

TL;DR

ForeSea introduces a novel multimodal forensic search system and benchmark for long-horizon surveillance videos, enabling more accurate retrieval and temporal reasoning with image-and-text queries.

Contribution

The paper presents ForeSea, a new plug-and-play forensic search pipeline and ForeSeaQA, a benchmark for evaluating multimodal video question answering with temporal grounding.

Findings

01

ForeSea improves retrieval accuracy by 3.5% over prior models.

02

Temporal IoU increases by 11.0 with ForeSea.

03

ForeSeaQA is the first benchmark supporting complex multimodal queries with precise temporal grounding.

Abstract

Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods -- tracking pipelines, CLIP based models, and VideoRAG -- require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., "When does this person join the fight?" with the person's image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization