Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video
Jiantang Huang

TL;DR
This paper introduces a two-pass, zero-shot method for localizing rare traffic accidents in CCTV footage without fine-tuning, achieving significant improvements on a large benchmark.
Contribution
It proposes a novel coarse-to-fine two-pass pipeline combined with specialist role assignment for joint temporal-spatial accident grounding in surveillance videos.
Findings
Achieves an ACC^S score of 0.539 on the ACCIDENT@CVPR 2026 benchmark.
Outperforms previous best baseline by +0.127 in accuracy.
Uses up to three API calls per video, costing approximately $20.
Abstract
Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]):…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
