GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization
Syed Ahsan Masud Zaidi, Lior Shamir, William Hsu, Scott Dietrich, Talha Zaidi

TL;DR
GRAZE is a training-free, robust pipeline for zero-shot localization of the first contact in American football practice videos, effectively handling clutter, motion, and multiple athletes.
Contribution
It introduces a novel approach combining grounding, motion reasoning, and pixel verification for contact localization without labeled examples.
Findings
Achieves 97.4% valid outputs on 738 videos.
Localizes FPOC within ±10 frames in 77.5% of clips.
Operates effectively without task-specific training.
Abstract
American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
