A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features
Olga Slizovskaia, Emilia G\'omez, Gloria Haro

TL;DR
This paper investigates the explainability of CNNs in music audio recognition by comparing learned activations with traditional hand-crafted audio features, revealing correlations between neural responses and classical audio descriptors.
Contribution
It introduces a method to measure similarity between CNN activation maps and traditional audio features, enhancing understanding of neural representations in music information retrieval.
Findings
Shallow layer activations correlate with harmonic and percussive features.
Deep layer activations relate to chromagrams, loudness, and onset rate.
Some neurons explicitly correspond to classical audio features.
Abstract
The explainability of Convolutional Neural Networks (CNNs) is a particularly challenging task in all areas of application, and it is notably under-researched in music and audio domain. In this paper, we approach explainability by exploiting the knowledge we have on hand-crafted audio features. Our study focuses on a well-defined MIR task, the recognition of musical instruments from user-generated music recordings. We compute the similarity between a set of traditional audio features and representations learned by CNNs. We also propose a technique for measuring the similarity between activation maps and audio features which typically presented in the form of a matrix, such as chromagrams or spectrograms. We observe that some neurons' activations correspond to well-known classical audio features. In particular, for shallow layers, we found similarities between activations and harmonic and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Neural Networks and Applications · Generative Adversarial Networks and Image Synthesis
A Case Study of Deep-Learned Activations
via Hand-Crafted Audio Features
Olga Slizovskaia
Emilia Gómez
Gloria Haro
Abstract
This work presents a method for analysis of the activations of audio convolutional neural networks by use of hand-crafted audio features. We analyse activations from three CNN architectures trained on different datasets and compare shallow-level activation maps with harmonic-percussive source separation and chromagrams, and deep-level activations with loudness and onset rate.
Machine Learning, Music, Convolutional Neural Networks, Explainability
1 Introduction
In this paper, we focus on feature analysis in the music domain. Our goal is to find similar patterns between the features (activations and activation maps) learned by a network and hand-crafted audio features, which are well understood in the literature. For that purpose, we analyse features from a dataset of user-generated recordings of different musical instrument performances. We address musical instrument recognition as it is a well-defined task and it can be objectively evaluated.
For feature attribution understanding, there are two major directions: (1) perturbation based algorithms, such as LIME (Ribeiro et al., 2016), Axiomatic Attribution (Sundararajan et al., 2017) or Saliency Analysis (Montavon et al., 2017), and (2) gradient-based algorithms such as Guided Backpropagation (Simonyan et al., 2013; Montavon et al., 2017), Class-Activation Mapping (CAM) (Zhou et al., 2016), and Network Dissection (Bau et al., 2017). In music domain, SoundLIME (Mishra et al., 2017) algorithm has been adapted from the original LIME. However, in most cases, the above techniques can be limitedly applied to spectrograms because, unlike a typical image, two dimensions of a spectrogram represent different qualities namely time and frequency.
Therefore, manual feature exploration remains popular. One could create a playlist which corresponds to a particular neuron, and make a decision of this neuron ’specialization’ by listening to the playlist. This approach was proposed by (Dieleman, 2014) and it provides valuable insights. However, it is not scalable because it requires an expert to listen to the playlist and guess the rationale behind.
Also, we can take advantage of a number of well-established mid-level audio features that have been proposed and studied in the MIR literature (Schedl et al., 2014). We know that CNNs in computer vision learn boundaries in the first layer and more complex concepts in subsequent layers. We hypothesize that audio-based CNNs can occasionally learn some of the hand-crafted features in a similar manner. We try to identify those features in pre-trained neural networks.
2 Methodology
Hand-crafted audio features. We focus our study in a compact set of mid-level features related to different musical facets: onset rate, loudness and Harmonic Pitch Class Profile (HPCP) computed by Essentia (Bogdanov et al., 2013), and Harmonic/Percussive Sound Separation (HPSS) computed by librosa (McFee et al., 2015).
Network Architectures. We explore three state-of-the-art VGG-style architectures: CNN AudioTagger (CNN-AT) (Choi et al., 2016), VGGish (Hershey et al., 2017), and Musically Motivated CNN (MM-CNN) (Pons et al., 2017). All three receive mel-spectrum as the input, consist of blocks of convolutional and max-pooling layers, and dense layers.
The differences between architectures and their initializations include filters’ shape (squared filters in CNN-AT and VGGish, and rectangular filters in MM-CNN), activation function and pre-training settings. We trained CNN-AT and MM-CNN on a subset of FCVID (Jiang et al., 2015) dataset. VGGish is initialized with weights provided by the authors. This network has been trained on a large-scale AudioSet dataset (Gemmeke et al., 2017) and potentially have stronger discriminative ability.
Similarity measures: individual activations. For high-level embeddings of a network, we consider each activation as an individual feature and compare them with onset rate and mean loudness. We consider two similarity metrics: (1) Pearson Correlation Coefficient and (2) Euclidean distance over the normalized vectors.
Similarity measures: activation maps. Activations of convolutional layers have a form of a matrix. They are slightly offset from the original input spectrum due to the padding, and proportionally scaled to the input because of max pooling. To some extent, we can think of them as pseudo-spectrograms or as filtered and aggregated spectrograms. In order to compare those activations with HPSS or HPCP, we need a method for fuzzy matrix comparison which is scale- and shift-invariant. We propose a visual-inspired similarity metric based on Scale-Invariant Feature Transform (SIFT) (Lowe, 2004) descriptors. SIFT descriptors are among the most recognized features in computer vision and a reasonable choice for similarity measurement (Hua et al., 2012).
To compute similarity between a feature map and an activation map we compute SIFT descriptors and matches between descriptors. An example of matching is shown in Figure 1. Each match is characterized by the matched descriptor indexes and a matching distance.
3 Experiments and Results
High-level embeddings vs. onset rate and loudness. We explored three high-level activation layers of VGGish model: an embedding layer with 128 neurons and two fully-connected layers with 4096 neurons each. For the embedding layer, we found statistically significant correlations for both onset rate and loudness, and some examples of the corresponding features are shown in Figure 2. In the first fully-connected layer we discovered that neuron #1964 has an outstanding correlation with loudness (with correlation coefficient ). For CNN-AT we found that activation #259 corresponds to onset rate.
Low-level feature correspondences. We found a number of interesting activation maps which look similar to HPSS decomposition in the first convolutional layer of VGGish network. The histograms of similarity metrics with respect to activation maps can be found in supplementary materials.111Supplementary materials (high resolution figures, code and more examples) are located at https://goo.gl/jM3jZM. The second convolutional layer of VGGish network does not have a strong correspondence to HPSS decomposition even though some linear combinations of activation maps could be similar.
For CNN-AT network we examine the second convolutional layer and we observe that similarity metric histograms for HPSS decomposition are not consistent which might be related to a higher false matching rate between decompositions and activation maps. Finally, the first layers of MM-CNN architecture represent a strongly filtered spectrograms, so we presume that the tall rectangular filters of this architecture are similar to band-pass filters.
4 Conclusion
Even if the models we investigate are complex and allow to construct features in a very different way than traditional methods, the correspondences between hand-crafted features and activations provide insights for better understanding of the internal representations of CNNs. We believe that the proposed methodology can be applied to identify important neurons in other tasks and architectures.
5 Acknowledgement
This work has received funding from the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant 770376, TROMPA). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used for this research.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bau et al. (2017) Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition , 2017.
- 2Bogdanov et al. (2013) Bogdanov, D., Wack, N., Gómez, E., Gulati, S., Herrera, P., Mayor, O., Roma, G., Salamon, J., Zapata, J. R., and Serra, X. ESSENTIA: an Audio Analysis Library for Music Information Retrieval. In International Society for Music Information Retrieval Conference (ISMIR’13) , pp. 493–498, 2013.
- 3Choi et al. (2016) Choi, K., Fazekas, G., and Sandler, M. Automatic Tagging Using Deep Convolutional Neural Networks. In International Society of Music Information Retrieval Conference . ISMIR, 2016.
- 4Dieleman (2014) Dieleman, S. Recommending music on Spotify with deep learning, 2014. URL http://benanne.github.io/2014/08/05/spotify-cnns.html .
- 5Gemmeke et al. (2017) Gemmeke, J.F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2017.
- 6Hershey et al. (2017) Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, C., Plakal, M., D., Platt., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R., and Wilson, K. CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 2017.
- 7Hua et al. (2012) Hua, S., Chen, G., Wei, H., and Jiang, Q. Similarity measure for image resizing using SIFT feature. EURASIP Journal on Image and Video Processing , 2012(1):6, 2012.
- 8Jiang et al. (2015) Jiang, Y.-G., Wu, Z, Wang, J., Xue, X., and Chang, S.-F. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks. ar Xiv preprint ar Xiv:1502.07209 , 2015.
