ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance
Hyojin Park, Yi Li, Janghoon Cho, Sungha Choi, Jungsoo Lee, Taotao Jing, Shuai Zhang, Munawar Hayat, Dashan Gao, Ning Bi, Fatih Porikli

TL;DR
ForeSea introduces a novel multimodal forensic search system and benchmark for long-horizon surveillance videos, enabling more accurate retrieval and temporal reasoning with image-and-text queries.
Contribution
The paper presents ForeSea, a new plug-and-play forensic search pipeline and ForeSeaQA, a benchmark for evaluating multimodal video question answering with temporal grounding.
Findings
ForeSea improves retrieval accuracy by 3.5% over prior models.
Temporal IoU increases by 11.0 with ForeSea.
ForeSeaQA is the first benchmark supporting complex multimodal queries with precise temporal grounding.
Abstract
Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods -- tracking pipelines, CLIP based models, and VideoRAG -- require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., "When does this person join the fight?" with the person's image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
