SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models

Oriol Rabasseda; Zenjie Li; Kamal Nasrollahi; and Sergio Escalera

arXiv:2601.04824·cs.CV·January 12, 2026

SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models

Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, and Sergio Escalera

PDF

Open Access

TL;DR

SOVABench is a new surveillance-focused video retrieval benchmark testing vehicle action discrimination and temporal understanding, revealing current models' challenges and demonstrating a training-free MLLM-based approach for interpretable embeddings.

Contribution

The paper introduces SOVABench, a novel real-world surveillance benchmark for vehicle action retrieval, and proposes a training-free MLLM-based framework for interpretable video and image embeddings.

Findings

01

State-of-the-art models struggle with cross-action discrimination.

02

MLLM-based embeddings outperform contrastive vision-language models on several benchmarks.

03

SOVABench provides a challenging testbed for surveillance action retrieval.

Abstract

Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Surveillance and Tracking Methods