AENet: Learning Deep Audio Features for Video Analysis

Naoya Takahashi; Michael Gygli; Luc Van Gool

arXiv:1701.00599·cs.MM·January 5, 2017·5 cites

AENet: Learning Deep Audio Features for Video Analysis

Naoya Takahashi, Michael Gygli, Luc Van Gool

PDF

Open Access 1 Repo

TL;DR

This paper introduces AENet, a deep CNN for audio event recognition that captures long-term temporal structures, outperforming previous methods and enhancing video analysis tasks like action recognition and highlight detection.

Contribution

AENet is a novel deep audio feature extractor that operates on large temporal inputs, enabling end-to-end training and improved audio and video analysis performance.

Findings

01

AENet outperforms previous audio event detection methods by 16%.

02

Combining AENet features with visual features improves video action recognition.

03

AENet features enhance video highlight detection by over 8%.

Abstract

We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear sub-word units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works this allows us to train an audio event detection system end-to-end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learnt generic audio features, similar to the way CNNs learn generic features on vision tasks. In video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

znaoya/aenet
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech and Audio Processing