Modeling Music Modality with a Key-Class Invariant Pitch Chroma CNN
Anders Elowsson, Anders Friberg

TL;DR
This paper introduces a CNN model that analyzes polyphonic music to predict modality, achieving high accuracy and key invariance by innovative pitch chroma processing and harmony analysis across scales.
Contribution
The paper presents a novel CNN architecture that incorporates key-class invariance through pitch chroma pooling and harmony analysis, improving modality prediction in polyphonic music.
Findings
Achieved R2 of about 0.71 in modality prediction
Outperformed previous systems and human listeners
Demonstrated importance of long-scale pitch processing and pooling
Abstract
This paper presents a convolutional neural network (CNN) that uses input from a polyphonic pitch estimation system to predict perceived minor/major modality in music audio. The pitch activation input is structured to allow the first CNN layer to compute two pitch chromas focused on different octaves. The following layers perform harmony analysis across chroma and time scales. Through max pooling across pitch, the CNN becomes invariant with regards to the key class (i.e., key disregarding mode) of the music. A multilayer perceptron combines the modality activation output with spectral features for the final prediction. The study uses a dataset of 203 excerpts rated by around 20 listeners each, a small challenging data size requiring a carefully designed parameter sharing. With an R2 of about 0.71, the system clearly outperforms previous systems as well as individual human listeners. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Neuroscience and Music Perception · Music Technology and Sound Studies
MethodsMax Pooling
