Weakly-Supervised Temporal Localization via Occurrence Count Learning
Julien Schroeter, Kirill Sidorov, David Marshall

TL;DR
This paper introduces a weakly-supervised deep learning model that learns to localize events in time using only occurrence counts, reducing annotation effort while achieving results comparable to fully-supervised methods.
Contribution
The paper presents a novel counting-based training framework that implicitly learns temporal localization without explicit location annotations.
Findings
Effective in audio event detection and image digit detection
Achieves performance comparable to fully-supervised methods
Reduces annotation effort significantly
Abstract
We propose a novel model for temporal detection and localization which allows the training of deep neural networks using only counts of event occurrences as training labels. This powerful weakly-supervised framework alleviates the burden of the imprecise and time-consuming process of annotating event locations in temporal data. Unlike existing methods, in which localization is explicitly achieved by design, our model learns localization implicitly as a byproduct of learning to count instances. This unique feature is a direct consequence of the model's theoretical properties. We validate the effectiveness of our approach in a number of experiments (drum hit and piano onset detection in audio, digit detection in images) and demonstrate performance comparable to that of fully-supervised state-of-the-art methods, despite much weaker training requirements.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Anomaly Detection Techniques and Applications · Time Series Analysis and Forecasting
