OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman

TL;DR
This paper introduces OVR, a large-scale dataset for open vocabulary temporal repetition counting in videos, along with a transformer-based baseline model, OVRCounter, capable of localizing and counting repetitions across diverse actions.
Contribution
The paper presents the creation of the extensive OVR dataset and a novel transformer-based model for open vocabulary repetition counting in videos.
Findings
OVR dataset contains over 72K videos with detailed annotations.
OVRCounter effectively localizes and counts repetitions in long videos.
Model performance improves with text-based target class specification.
Abstract
We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
