SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models
Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, and Sergio Escalera

TL;DR
SOVABench is a new surveillance-focused video retrieval benchmark testing vehicle action discrimination and temporal understanding, revealing current models' challenges and demonstrating a training-free MLLM-based approach for interpretable embeddings.
Contribution
The paper introduces SOVABench, a novel real-world surveillance benchmark for vehicle action retrieval, and proposes a training-free MLLM-based framework for interpretable video and image embeddings.
Findings
State-of-the-art models struggle with cross-action discrimination.
MLLM-based embeddings outperform contrastive vision-language models on several benchmarks.
SOVABench provides a challenging testbed for surveillance action retrieval.
Abstract
Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Surveillance and Tracking Methods
