Framework for evaluation of sound event detection in web videos
Rohan Badlani, Ankit Shah, Benjamin Elizalde, Anurag Kumar, Bhiksha, Raj

TL;DR
This paper proposes a framework for large-scale sound event detection in web videos by using search queries as labels, demonstrating that search query-based labels closely approximate human annotations in performance.
Contribution
It introduces a novel approach to label web videos for sound event detection using search queries, enabling large-scale recognition without manual labeling.
Findings
Search query labels closely match human labels within 10% performance difference.
The framework successfully predicts sound events in 3.7 million web video segments.
Using search queries as labels is a viable preliminary method for sound event recognition.
Abstract
The largest source of sound events is web videos. Most videos lack sound event labels at segment level, however, a significant number of them do respond to text queries, from a match found using metadata by search engines. In this paper we explore the extent to which a search query can be used as the true label for detection of sound events in videos. We present a framework for large-scale sound event recognition on web videos. The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets. The datasets are used to train three classifiers, and we obtain a prediction on 3.7 million web video segments. We evaluated performance using the search query as true label and compare it with human labeling. Both types of ground truth exhibited close performance, to within 10%, and similar performance trend with increasing number of evaluated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
