AENet: Learning Deep Audio Features for Video Analysis
Naoya Takahashi, Michael Gygli, Luc Van Gool

TL;DR
This paper introduces AENet, a deep CNN for audio event recognition that captures long-term temporal structures, outperforming previous methods and enhancing video analysis tasks like action recognition and highlight detection.
Contribution
AENet is a novel deep audio feature extractor that operates on large temporal inputs, enabling end-to-end training and improved audio and video analysis performance.
Findings
AENet outperforms previous audio event detection methods by 16%.
Combining AENet features with visual features improves video action recognition.
AENet features enhance video highlight detection by over 8%.
Abstract
We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear sub-word units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works this allows us to train an audio event detection system end-to-end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learnt generic audio features, similar to the way CNNs learn generic features on vision tasks. In video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech and Audio Processing
