Dance Hit Song Prediction

Dorien herremans; David Martens; Kenneth S\"orensen

arXiv:1905.08076·cs.SD·May 21, 2019

Dance Hit Song Prediction

Dorien herremans, David Martens, Kenneth S\"orensen

PDF

Open Access

TL;DR

This paper develops and evaluates machine learning models to predict whether dance songs will become top 10 hits, using a comprehensive dataset with musical features from 1985 to 2013.

Contribution

It introduces a new dataset of dance hits with advanced temporal features and compares multiple classifiers for hit song prediction.

Findings

01

Best model accurately predicts top 10 dance hits

02

Temporal features improve prediction performance

03

Multiple classifiers tested with varying success

Abstract

Record companies invest billions of dollars in new talent around the globe each year. Gaining insight into what actually makes a hit song would provide tremendous benefits for the music industry. In this research we tackle this question by focussing on the dance hit song classification problem. A database of dance hit songs from 1985 until 2013 is built, including basic musical features, as well as more advanced features that capture a temporal aspect. A number of different classifiers are used to build and test dance hit prediction models. The resulting best model has a good performance when predicting whether a song is a "top 10" dance hit versus a lower listed position.

Tables9

Table 1. Table 1: Hit listings overview.

	OCC	BB
Top	40	10
Date range	10/2009–3/2013	1/1985–3/2013
Hit listings	7,159	14,533
Unique songs	759	3,361

Table 2. Table 2: Example of hit listings before adding musical features.

Song title	Artist	Position	Date	Peak position
Harlem Shake	Bauer	2	09/03/13	1
Are You Ready For Love	Elton John	40	08/12/12	34
The Game Has Changed	Daft Punk	32	18/12/10	32
…

Table 3. Table 3: Datasets used for the dance hit prediction model.

Dataset	Hits	Non-hits	Size
D1	Top 10	Top 30-40	400
D2	Top 10	Top 20-40	550
D3	Top 20	Top 20-40	697

Table 4. Table 4: The most commonly occurring features in D1, D2 and D3 after FS.

Feature	Occurance	Feature	Occurance
Beatdiff (range)	3	Timbre 1 (mean)	2
Timbre 1 (80 perc)	3	Timbre 1 (median)	2
Timbre 1 (max)	3	Timbre 2 (max)	2
Timbre 1 (stdev)	3	Timbre 2 (mean)	2
Timbre 2 (80 perc)	3	Timbre 2 (range)	2
Timbre 3 (mean)	3	Timbre 3 (var)	2
Timbre 3 (median)	3	Timbre 4 (80 perc)	2
Timbre 3 (min)	3	Timbre 5 (mean)	2
Timbre 3 (stdev)	3	Timbre 5 (stdev)	2
Beatdiff (80 perc)	2	Timbre 6 (median)	2
Beatdiff (stdev)	2	Timbre 6 (range)	2
Beatdiff (var)	2	Timbre 6 (var)	2
Timbre 11 (80 perc)	2	Timbre 7 (var)	2
Timbre 11 (var)	2	Timbre 8 (Median)	2
Timbre 12 (kurtosis)	2	Timbre 9 (kurtosis)	2
Timbre 12 (Median)	2	Timbre 9 (max)	2
Timbre 12 (min)	2	Timbre 9 (Median)	2

Table 5. Table 5: RIPPER ruleset.

(T1mean

\leq

-0.020016) and (T3min

\leq

-0.534123) and (T2max

\geq

-0.250608)

\Rightarrow

NoHit

(T880perc

\leq

-0.405264) and (T3mean

\leq

-0.075106)

\Rightarrow

NoHit

\Rightarrow

Hit

Table 6. Table 6: Results with 10-fold validation (accuracy).

Accuracy (%)	D1		D2		D3
	-	FS	-	FS	-	FS
C4.5	57.05	58.25	54.95	54.67	54.58	54.74
RIPPER	60.95	62.43	56.69	56.42	57.18	56.41
Naive Bayes	65	65	60.22	58.78	59.57	59.18
Logistic regression	64.65	64	62.64	60.6	60.12	59.75
SVM (Polynomial)	64.97	64.7	61.55	61.6	61.04	61.07
SVM (RBF)	64.7	64.63	59.8	59.89	60.8	60.76

Table 7. Table 7: Results for 10 runs with 10-fold validation (AUC).

AUC	D1		D2		D3
	-	FS	-	FS	-	FS
C4.5	0.53	0.55	0.55	0.54	0.54	0.53
RIPPER	0.55	0.56	0.56	0.56	0.54	0.55
Naive Bayes	0.64	0.65	0.64	0.63	0.6	0.61
Logistic regression	0.65	0.65	0.67	0.64	0.61	0.63
SVM (Polynomial)	0.6	0.59	0.61	0.61	0.58	0.58
SVM (RBF)	0.56	0.56	0.59	0.6	0.57	0.57

Table 8. Table 8: Results for 10 runs on D1 (FS) with 10-fold cross validation compared with the split test set.

	AUC		accuracy (%)
	split	10CV	split	10CV
C4.5	0.62	0.55	62.50	58.25
RIPPER	0.66	0.56	85	62.43
Naive Bayes	0.79	0.65	77.50	65
Logistic regression	0.81	0.65	80	64
SVM (Polynomial)	0.729	0.59	85	64.7
SVM (RBF)	0.57	0.56	82.5	64.63

Table 9. Table 9: Confusion matrix logistic regression.

a	b	$\leftarrow$ classified as
209	44	a = hit
100	47	b = non-hit

Equations20

P (x ∣ Y = y) = j = 1 \prod M P (x_{j} ∣ Y = y),

P (x ∣ Y = y) = j = 1 \prod M P (x_{j} ∣ Y = y),

P (Y ∣ x) = \frac{P ( Y ) \cdot \prod _{j = 1}^{M} P ( x _{j} ∣ Y )}{P ( x )}

P (Y ∣ x) = \frac{P ( Y ) \cdot \prod _{j = 1}^{M} P ( x _{j} ∣ Y )}{P ( x )}

f_{hi t} (s_{i}) = \frac{1}{1 + e ^{- s_{i}}} whereby s_{i} = b + j = 1 \sum M a_{j} \cdot x_{j}

f_{hi t} (s_{i}) = \frac{1}{1 + e ^{- s_{i}}} whereby s_{i} = b + j = 1 \sum M a_{j} \cdot x_{j}

\left\{\begin{array}[]{lcr}{\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}}_{i})+b\geq+1,&&{\rm if}\,\,\,\,y_{i}=+1\\ {\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}}_{i})+b\leq-1,&&{\rm if}\,\,\,\,y_{i}=-1\end{array}\right.

\left\{\begin{array}[]{lcr}{\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}}_{i})+b\geq+1,&&{\rm if}\,\,\,\,y_{i}=+1\\ {\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}}_{i})+b\leq-1,&&{\rm if}\,\,\,\,y_{i}=-1\end{array}\right.

\textstyle y_{i}[{\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}}_{i})+b]\geq 1,\hskip 14.22636pti=1,...,N.

\textstyle y_{i}[{\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}}_{i})+b]\geq 1,\hskip 14.22636pti=1,...,N.

\textstyle y({\mathbf{x}})={\rm sign}[{\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}})+b],

\textstyle y({\mathbf{x}})={\rm sign}[{\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}})+b],

\textstyle\min_{{\mathbf{w}},b,\mbox{\boldmath$\xi$}}\mathcal{J}({\mathbf{w}},b,\mbox{\boldmath$\xi$})=\frac{1}{2}{\mathbf{w}}^{T}{\mathbf{w}}+C\,\sum_{i=1}^{N}\xi_{i}

\textstyle\min_{{\mathbf{w}},b,\mbox{\boldmath$\xi$}}\mathcal{J}({\mathbf{w}},b,\mbox{\boldmath$\xi$})=\frac{1}{2}{\mathbf{w}}^{T}{\mathbf{w}}+C\,\sum_{i=1}^{N}\xi_{i}

\textstyle\left\{\begin{array}[]{ll}y_{i}[{\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}}_{i})+b]\geq 1-\xi_{i},&i=1,...,N\\ \xi_{i}\geq 0,&i=1,...,N.\end{array}\right.

\textstyle\left\{\begin{array}[]{ll}y_{i}[{\mathbf{w}}^{T}\mbox{\boldmath$\varphi$}({\mathbf{x}}_{i})+b]\geq 1-\xi_{i},&i=1,...,N\\ \xi_{i}\geq 0,&i=1,...,N.\end{array}\right.

y (x) = sign [\sum_{i = 1}^{N} α_{i} y_{i} K (x_{i}, x) + b],

y (x) = sign [\sum_{i = 1}^{N} α_{i} y_{i} K (x_{i}, x) + b],

\displaystyle\begin{array}[]{ll}K({\mathbf{x}},{\mathbf{x}}_{i})=(1+{{\mathbf{x}}_{i}^{T}{\mathbf{x}}}/{c})^{d},&{\mathrm{(polynomial\ kernel)}}\\ K({\mathbf{x}},{\mathbf{x}}_{i})=\exp\{-\|{\mathbf{x}}-{\mathbf{x}}_{i}\|_{2}^{2}/\sigma^{2}\},&{\mathrm{(RBF\ kernel)}}\\ \end{array}

\displaystyle\begin{array}[]{ll}K({\mathbf{x}},{\mathbf{x}}_{i})=(1+{{\mathbf{x}}_{i}^{T}{\mathbf{x}}}/{c})^{d},&{\mathrm{(polynomial\ kernel)}}\\ K({\mathbf{x}},{\mathbf{x}}_{i})=\exp\{-\|{\mathbf{x}}-{\mathbf{x}}_{i}\|_{2}^{2}/\sigma^{2}\},&{\mathrm{(RBF\ kernel)}}\\ \end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music History and Culture · Diverse Musicological Studies

Full text

Dance Hit Song Prediction

Dorien Herremansa, David Martensb and Kenneth Sörensena

aANT/OR, University of Antwerp Operations Research Group

bApplied Data Mining Research Group, University of Antwerp

Prinsstraat 13, B-2000 Antwerp Corresponding author. Email: [email protected]

Abstract

Record companies invest billions of dollars in new talent around the globe each year. Gaining insight into what actually makes a hit song would provide tremendous benefits for the music industry. In this research we tackle this question by focussing on the dance hit song classification problem. A database of dance hit songs from 1985 until 2013 is built, including basic musical features, as well as more advanced features that capture a temporal aspect. A number of different classifiers are used to build and test dance hit prediction models. The resulting best model has a good performance when predicting whether a song is a “top 10” dance hit versus a lower listed position.

1 Introduction

In 2011 record companies invested a total of 4.5 billion in new talent worldwide (IFPI, 2012). Gaining insight into what actually makes a song a hit would provide tremendous benefits for the music industry. This idea is the main drive behind the new research field referred to as “Hit song science” which Pachet (2012) define as “an emerging field of science that aims at predicting the success of songs before they are released on the market”.

There is a large amount of literature available on song writing techniques (Braheny, 2007; Webb, 1999). Some authors even claim to teach the reader how to write hit songs (Leikin, 2008; Perricone, 2000). Yet very little research has been done on the task of automatic prediction of hit songs or detection of their characteristics.

The increase in the amount of digital music available online combined with the evolution of technology has changed the way in which we listen to music. In order to react to new expectations of listeners who want searchable music collections, automatic playlist suggestions, music recognition systems etc., it is essential to be able to retrieve information from music (Casey et al., 2008). This has given rise to the field of Music Information Retrieval (MIR), a multidisciplinary domain concerned with retrieving and analysing multifaceted information from large music databases (Downie, 2003).

Many MIR systems have been developed in recent years and applied to a range of different topics such as automatic classification per genre (Tzanetakis and Cook, 2002), cultural origin (Whitman and Smaragdis, 2002), mood (Laurier et al., 2008), composer (Herremans et al., 2013), instrument (Essid et al., 2006), similarity (Schnitzer et al., 2009), etc. An extensive overview is given by Fu et al. (2011). Yet, as it appears, the use of MIR systems for hit prediction remains relatively unexplored.

The first exploration into the domain of hit science is due to Dhanaraj and Logan (2005). They used acoustic and lyric-based features to build support vector machines (SVM) and boosting classifiers to distinguish top 1 hits from other songs in various styles. Although acoustic and lyric data was only available for 91 songs, their results seem promising. The study does however not provide details about data gathering, features, applied methods and tuning procedures.

Based on the claim of the unpredictability of cultural markets made by Salganik et al. (2006), Pachet and Roy (2008) examined the validity of this claim on the music market. Based on a dataset they were not able to develop an accurate classification model for low, medium or high popularity based on acoustic and human features. They suggest that the acoustic features they used are not informative enough to be used for aesthetic judgements and suspect that the previously mentioned study (Dhanaraj and Logan, 2005) is based on spurious data or biased experiments.

Borg and Hokkanen (2011) draw similar conclusions as Pachet and Roy (2008). They tried to predict the popularity of music videos based on their YouTube view count by training support vector machines but were not successful.

Another experiment was set up by Ni et al. (2011), who claim to have proven that hit song science is once again a science. They were able to obtain more optimistic results by predicting if a song would reach a top 5 position on the UK top 40 singles chart compared to a top 30-40 position. The shifting perceptron model that they built was based on thus far novel audio features mostly extracted from The Echo Nest111echonest.com. Though they describe the features they used on their website (Jehan and DesRoches, 2012), the paper is very short and does not disclose a lot of details about the research such as data gathering, preprocessing, detailed description of the technique used or its implementation.

In this research accurate models are built to predict if a song is a top 10 dance hit or not. For this purpose, a dataset of dance hits including some unique audio features is compiled. Based on this data different efficient models are built and compared. To the authors’ knowledge, no previous research has been done on the dance hit prediction problem.

In the next section, the dataset used in this paper is elaborately discussed. In Section 3 the data is visualized in order to detect some temporal patterns. Finally, the experimental setup is described and a number of models are built and tested.

2 Dataset

The dataset used in this research was gathered in a few stages. The first stage involved determining which songs can be considered as hit songs versus which songs cannot. Secondly, detailed information about musical features was obtained for both aforementioned categories.

2.1 Hit Listings

Two hit archives available online were used to create a database of dance hits (see Table 1). The first one is the singles dance archive from the Official Charts Company (OCC)222officialcharts.com. The Official Charts Company is operated by both the British Phonographic Industry and the Entertainment Retailers Association ERA. Their charts are produced based on sales data from retailers through market researcher Millward Brown. The second source is the singles dance archive from Billboard (BB)333billboard.com. Billboard is one of the oldest magazines in the world devoted to music and the music industry.

The information was parsed from both websites using the Open source Java html parser library JSoup (Houston, 2013) and resulted in a dataset of 21,692 (7,159 + 14,533) listings with 4 features: song title, artist, position and date. A very small number of hit listings could not be parsed and these were left out of the dataset. The peak chart position for each song was computed and added to the dataset as a fifth feature. Table 2 shows an example of the dataset at this point.

2.2 Feature Extraction And Calculation

The Echo Nest444echonest.com was used in order to obtain musical characteristics for the song titles obtained in previous subsection. The Echo Nest is the world’s leading music intelligence company and has over a trillion data points on over 34 million songs in its database. Its services are used by industry leaders such as Spotify, Nokia, Twitter, MTV, EMI and more (EchoNest, 2013). Bertin-Mahieux et al. (2011) used The Echo Nest to build The One Million Song dataset, a very large freely available dataset that offers a collection of audio features and meta-information for a million contemporary popular songs.

In this research The Echo Nest was used to build a new database mapped to the hit listings. The Open Source java client library jEN for the Echo Nest developer API was used to query the songs (Lamere, 2013). Based on the song title and artist name, The Echo Nest database and Analyzer were queried for each of the parsed hit songs. After some manual and java-based corrections for spelling irregularities (e.g., Featuring, Feat, Ft.) data was retrieved for 697 out of 759 unique songs from the OCC hit listings and 2,755 out of 3,361 unique songs from the BB hit listings. The songs with missing data were removed from the dataset. The extracted features can be divided into three categories: meta-information, basic features from The Echo Nest Analyzer and temporal features.

2.2.1 Meta-Information

The first category is meta-information such as artist location, artist familiarity, artist hotness, song hotness etc. This is descriptive information about the song, often not related to the audio signal itself. One could follow the statement of IBM’s Bob Mercer in 1985 “There is no data like more data” (Jelinek, 2005). Yet, for this research, the meta-information is discarded when building the classification models. In this way, the model can work with unknown songs, based purely on audio signals.

2.2.2 Basic Analyzer Features

The next category consists of basic features extracted by The Echo Nest Analyzer (Jehan and DesRoches, 2012). Most of these features are self-explanatory, except for energy and danceability, of which The Echo Nest did not yet release the formula.

Duration

Length of the track in seconds.

Tempo

The average tempo expressed in beats per minute (bpm).

Time signature

A symbolic representation of how many beats there are in each bar.

Mode

Describes if a song’s modality is major (1) or minor (0).

Key

The estimated key of the track, represented as an integer.

Loudness

The loudness of a track in decibels (dB), which correlates to the psychological perception of strength (amplitude).

Danceability

Calculated by The Echo Nest, based on beat strength, tempo stability, overall tempo, and more.

Energy

Calculated by The Echo Nest, based on loudness and segment durations.

A more detailed description of these Echo Nest features is given by Jehan and DesRoches (2012).

2.2.3 Temporal Features

A third category of features was added to incorporate the temporal aspect of the following basic features offered by the Analyzer:

Timbre

A 12-dimensional vector which captures the tone colour for each segment of a song. A segment is a sound entity (typically under a second) relatively uniform in timbre and harmony.

Beatdiff

The time difference between subsequent beats.

Timbre is a very perceptual feature that is sometimes referred to as tone colour. In The Echo Nest, 13 basis vectors are available that are derived from the principal components analysis (PCA) of the auditory spectrogram (Jehan, 2005). The first vector of the PCA is referred to as loudness, as it is related to the amplitude. The following 12 basis vectors are referred to as the timbre vectors. The first one can be interpreted as brightness, as it emphasizes the ratio of high frequencies versus low frequencies, a measure typically correlated to the “perceptual” quality of brightness. The second timbre vector has to do with flatness and narrowness of sound (attenuation of lowest and highest frequencies). The next vector represents the emphasis of the attack (sharpness) (EchoNest, 2013). The timbre vectors after that are harder to label, but can be understood by the spectral diagrams given by Jehan (2005).

In order to capture the temporal aspect of timbre throughout a song Schindler and Rauber (2012) introduce a set of derived features. They show that genre classification can be significantly improved by incorporating the statistical moments of the 12 segment timbre descriptors offered by The Echo Nest. In this research the statistical moments were calculated together with some extra descriptive statistics: mean, variance, skewness, kurtosis, standard deviation, 80th percentile, min, max, range and median.

Ni et al. (2013) introduce a variable called Beat CV in their model, which refers to the variation of the time between the beats in a song. In this research, the temporal aspect of the time between beats (beatdiff) is taken into account in a more complete way, using all the descriptive statistics from the previous paragraph.

After discarding the meta-information, the resulting dataset contained 139 usable features. In the next section, these features were analysed to discover their evolution over time.

3 Evolution Over Time

The dominant music that people listen to in a certain culture changes over time. It is no surprise that a hit song from the 60s will not necessarily fit in the contemporary charts. Even if we limit ourselves to one particular style of hit songs, namely dance music, a strong evolution can be distinguished between popular 90s dance songs and this week’s hit. In order to verify this statement and gain insight into how characteristics of dance music have changed, the Billboard dataset (BB) with top 10 dance hits from 1985 until now was analysed.

A dynamic chart was used to represent the evolution of four features over time (Google, 2013). Figure 1 shows a screenshot of the Google motion chart555Interactive motion chart available at http://antor.ua.ac.be/dance that was used to visualize the time series data. This graph integrates data mining and information visualization in one discovery tool as it reveals interesting patterns and allows the user to control the visual presentation, thus following the recommendation made by Shneiderman (2002). The x-axis shows the duration and the y-axis is the average loudness per year in Figure 1. Additional dimensions are represented by the size of the bubbles (brightness) and the colour of the bubbles (tempo).

Since a motion chart is a dynamic tool that should be viewed on a computer, a selection of features were extracted to more traditional 2-dimensional graphs with linear regressions (see Figure 2). Since the OCC dataset contains 3,361 unique songs, the selected features from these songs were averaged per year in order to limit the amount of data points on the graph. A rising trend can be detected for the loudness, tempo and 1st aspect of timbre (brightness). The correlation between loudness and tempo is in line with the rule proposed by Todd (1992) “The faster the louder, the softer the slower”. Not all features have an apparent relationship with time. Energy, for instance, (see Figure 2(e)), doesn’t seem to be correlated with time. It is also remarkable that the danceability feature computed by The Echo Nest decreases over time for dance hits. Since no detailed formula was given by The Echo Nest for danceability, this trend cannot be explained.

The next sections describes an experiment which compares several hit prediction models built in this research.

4 Dance Hit Prediction

In this section the experimental setup and preprocessing techniques are described for the classification models built in Section 5.

4.1 Experiment Setup

Figure 3 shows a graphical representation of the setup of the experiment described in Section 6.1. The dataset used for the hit prediction models in this section is based on the OCC listings. The reason for this is that this data contains top 40 songs, not just top 10. This will allow us to create a “gap” between the two classes. Since the previous section showed that the characteristics of hit songs evolve over time it is not representable to use data from 1985 for predicting contemporary hits. The dataset used for building the prediction models consists of dance hit songs from 2009 until 2013.

The peak chart position of each song was used to determine if they are a dance hit or not. Three datasets were made with each a different gap between the two classes (see Table 3). In the first dataset (D1), hits are considered to be songs with a peak position in the top 10. Non-hits are those that only reached a position between 30 and 40. In the second dataset (D2), the gap between hits and non-hits is smaller, as songs reaching a top position of 20 are still considered to be non-hits. Finally, the original dataset is split in two at position 20, without a gap to form the third dataset (D3). The reason for not comparing a top 10 hit with a song that did not appear in the charts is to avoid doing accidental genre classification. If a hit dance song would be compared to a song that does not occur in the hit listings, a second classification model would be needed to ensure that this non-hit song is in fact a dance song. If not, the developed model might distinguish songs based on whether or not they are a dance song instead of a hit. However, it should be noted that not all songs on the dance hit lists are in fact the same type of dance songs, there might be subgenres. Still, they will probably share more common attributes than songs from a random style, thus reducing the noise in the hit classification model. The sizes of the three datasets are listed in Table 3, the difference in size can be explained by the fact that songs are excluded in D1 and D2 to form the gap. In the next sections, models are built and compare the performance of classifiers on these three datasets.

The Open Source software Weka was used to create the models (Witten and Frank, 2005). Weka’s toolbox and framework is recognized as a landmark system in the data mining and machine learning field (Hall et al., 2009).

4.2 Preprocessing

The class distribution of the three datasets used in the experiment is displayed in Figure 4. Although the distribution is not heavily skewed, it is not completely balanced either. Because of this the use of the accuracy measure to evaluate our results is not suited and the area under the receiver operating curve (AUC) (Fawcett, 2004) was used instead (see section 6).

All of the features in the datasets were standardized using statistical normalization and feature selection was done (see Figure 3), using the procedure CfsSubsetEval from Weka with GeneticSearch. This procedure uses the individual predictive ability of each feature and the degree of redundancy between them to evaluate the worth of a subset of features (Hall, 1999). Feature selection was done in order to avoid the “curse of dimensionality” by having a very sparse feature set. McKay and Fujinaga (2006) point to the fact that having a limited amount of features allows for a thorough testing of the model with limited instances and can thus improve the quality of the classification model. Added benefits are the improved comprehensibility of a model with a limited amount of highly predictive variables (Hall, 1999) and better performance of the learning algorithm (Piramuthu, 2004).

The feature selection procedure in Weka reduces the data to 35–50 attributes, depending on the dataset. The most commonly occurring features after feature selection are listed in Table 4. Interesting to note is that the features danceability and energy both disappear from the reduced datasets, except for danceability which stays in the D3 dataset. This could be explained by the fact that these features are calculated by The Echo Nest based on other features.

5 Classification Techniques

A total of five models were built for each dataset using diverse classification techniques. The two first models (decision tree and ruleset) can be considered as the easiest to understand classification models due to their linguistic nature (Martens, 2008). The other three models focus on accurate prediction. In the following subsections, the individual algorithms are briefly discussed together with their main parameters and settings, followed by a comparison in Section 6. The AUC values mentioned in this section are based on 10-fold cross validation performance (Witten and Frank, 2005). The shown models are built on the entire dataset.

5.1 C4.5 Tree

A decision tree for dance hit prediction was built with J48, Weka’s implementation of the popular C4.5 algorithm (Witten and Frank, 2005).

The tree data structure consists of decision nodes and leaves. The class value is specified by the leaves, in this case hit or non-hit, and the nodes specify a test of one of the features. When a path from the node to a leave is followed based on the feature values of a particular song, a predictive rule can be derived (Ruggieri, 2002).

A “divide and conquer” approach is used by the C4.5 algorithm to build trees recursively (Quinlan, 1993). This is a top down approach, in which a feature is sought that best separates the classes, followed by pruning of the tree (Wu et al., 2008). This pruning is performed by a subtree raising operation in an inner cross-validation loop (3 folds by default in Weka) (Witten and Frank, 2005).

Decision trees have been used in a broad range of fields such as credit scoring (Hand and Henley, 1997), land cover mapping (Friedl and Brodley, 1997), medical diagnosis (Wolberg and Mangasarian, 1990), estimation of toxic hazards (Cramer et al., 1976), predicting customer behaviour changes (Kim et al., 2005) and others.

For the comparative tests in Section 6 Weka’s default settings were kept for J48. In order to create a simple abstracted model on dataset D1 (FS) for visual insight in the important features, a less accurate model (AUC 0.54) was created by pruning the tree to depth four.The resulting tree is displayed in Figure 5. It is noticeable that time differences between the third, fourth and ninth timbre vector seem to be important features for classification.

5.2 RIPPER Ruleset

Much like trees, rulesets are a useful tool to gain insight in the data. They have been used in other fields to gain insight in diagnosis of technical processes (Isermann and Balle, 1997), credit scoring (Baesens et al., 2003), medical diagnosis (Kononenko, 2001), customer relationship management (Ngai et al., 2009) and more.

In this section JRip, Weka’s implementation of the propositional rule learner RIPPER (Cohen, 1995), was used to inductively build “if-then” rules. The “Repeated Incremental Pruning to Produce Error Reduction algorithm” (RIPPER), uses sequential covering to generate the ruleset. In a first step of this algorithm, one rule is learned and the training instances that are covered by this rule are removed. This process is then repeated (Hall et al., 2009).

The ruleset displayed in Table 5 was generated with Weka’s default parameters for number of data instances (2) and folds (3) (AUC = 0.56 on dataset D1, see Table LABEL:tab:results). It’s notable that the third timbre vector is an important feature again. It would appear that this feature should not be underestimated when composing dance songs.

5.3 Naive Bayes

The naive Bayes classifier estimates the probability of a hit or non-hit based on the assumption that the features are conditionally independent. This conditional independence assumption is represented by equation (1) given class label $y$ (Tan et al., 2007).

[TABLE]

whereby each attribute set $\mathbf{x}=\{x_{1},x_{2},\dots,x_{N}\}$ consists of $M$ attributes.

Because of the conditional dependence assumption, the class-conditional probability for every combination of $\mathbf{X}$ does not need to be calculated. Only the conditional probability of each $x_{i}$ given Y has to be estimated. This offers a practical advantage since a good estimate of the probability can be obtained without the need for a very large training set.

Naive Bayes classifies a test record by calculating the posterior probability for each class Y (Lewis, 1998):

[TABLE]

Although this independence assumption is generally a poor assumption in practice, numerous studies prove that naive Bayes competes well with more sophisticated classifiers (Rish, 2001). In particular, naive Bayes seems to be particularly resistant to isolated noise points, robust to irrelevant attributes, but its performance can degrade by correlated attributes (Tan et al., 2007). Table LABEL:tab:results confirms that Naive Bayes performs very well, with an AUC of 0.65 on dataset D1 (FS).

5.4 Logistic Regression

The SimpleLogistic function in Weka was used to build a logistic regression model (Witten and Frank, 2005).

Equation (3) shows the output of a logistic regression, whereby $f_{hit}(s_{i})$ represents the probability that a song $i$ with $M$ features $x_{j}$ is a dance hit. This probability follows a logistic curve, as can be seen in Figure 6. The cut-off point of 0.5 will determine if a song is classified as a hit or a non-hit. With AUC = 0.65 for dataset D1 and AUC=0.67 for dataset D2 (see Table LABEL:tab:results), logistic regression performs best for this particular classification problem.

[TABLE]

Logistic regression models generally require limited computing power and are less prone to overfitting than other models such as neural networks (Tu, 1996). Like the previously mentioned models, they are also used in a number of domains, such as the creation of habitat models for animals (Pearce and Ferrier, 2000), medical diagnosis (Kurt et al., 2008), credit scoring (Wiginton, 1980) and others.

5.5 Support Vector Machines

Weka’s sequential minimal optimization algorithm (SMO) was used to build two support vector machine classifiers. The support vector machine (SVM) is a learning procedure based on the statistical learning theory (Vapnik, 1995). Given a training set of $N$ data points $\{({\mathbf{x}}_{i},y_{i})\}_{i=1}^{N}$ with input data ${\mathbf{x}}_{i}\in{\rm I\kern-1.69998ptR}^{n}$ and corresponding binary class labels $y_{i}\in\{-1,+1\}$ , the SVM classifier should fulfill following conditions. (Cristianini and Shawe-Taylor, 2000; Vapnik, 1995):

[TABLE]

which is equivalent to

[TABLE]

The non-linear function $\mbox{\boldmath$ \varphi $}(\cdot)$ maps the input space to a high (possibly infinite) dimensional feature space. In this feature space, the above inequalities basically construct a hyperplane ${\mathbf{w}}^{T}\mbox{\boldmath$ \varphi $}({\mathbf{x}})+b=0$ discriminating between the two classes. By minimizing ${\mathbf{w}}^{T}{\mathbf{w}}$ , the margin between both classes is maximized.

In primal weight space the classifier then takes the form

[TABLE]

but, on the other hand, is never evaluated in this form. One defines the convex optimization problem:

[TABLE]

subject to

[TABLE]

The variables $\xi_{i}$ are slack variables which are needed to allow misclassifications in the set of inequalities (e.g., due to overlapping distributions). The first part of the objective function tries to maximize the margin between both classes in the feature space and is a regularisation mechanism that penalizes for large weights, whereas the second part minimizes the misclassification error. The positive real constant $C$ is the regularisation coefficient and should be considered as a tuning parameter in the algorithm.

This leads to the following classifier (Cristianini and Shawe-Taylor, 2000):

[TABLE]

whereby $K({\mathbf{x}}_{i},{\mathbf{x}})=\mbox{\boldmath$ \varphi $}({\mathbf{x}}_{i})^{T}\mbox{\boldmath$ \varphi $}({\mathbf{x}})$ is taken with a positive definite kernel satisfying the Mercer theorem. The Lagrange multipliers $\alpha_{i}$ are then determined by optimizing the dual dual problem. The following kernel functions $K(\cdot,\cdot)$ were used:

[TABLE]

where $d$ , $c$ and $\sigma$ are constants.

For low-noise problems, many of the $\alpha_{i}$ will typically be equal to zero (sparseness property). The training observations corresponding to non-zero $\alpha_{i}$ are called support vectors and are located close to the decision boundary.

As equation (9) shows, the SVM classifier with non-linear kernel is a complex, non-linear function. Trying to comprehend the logics of the classifications made is quite difficult, if not impossible (Martens et al., 2009; Martens and Provost, 2014).

In this research, the Polynomial kernel and RBF kernel were used to build the models. Although Weka’s default settings were used in the previous models, the hyperparameters for the SVM model were optimized. To determine the optimal settings for the regularisation parameter $C$ (1, 3, 5,…21), the $\sigma$ for the RBF kernel ( $\frac{1}{\sigma^{2}}$ = 0.00001, 0.0001,…10) and the exponent $d$ for the polynomial kernel (1,2), GridSearch was used in Weka. The choice of hyperparameters to test was inspired by settings suggesting by Weka (2013b). GridSearch performs 2-fold cross validation on the initial grid. This grid is determined by the two input parameters ( $C$ and $\sigma$ for the RBF kernel, $C$ and $d$ for the polynomial kernel). 10-fold cross validation is then performed on the best point of the grid based on the weighted AUC by class size and its adjacent points. If a better pair is found, the procedure is repeated on its neighbours until no better pair is found or the border of the grid is reached (Weka, 2013a). This hyperparameter optimization is performed in the “classification model” box in Figure 3. The resulting AUC-value is 0.59 for the SVM with polynomial and 0.56 for the SVM with RBF kernel on D1 (FS) (see Table LABEL:tab:results).

6 Results

In this section, two experiments are described. The first one builds models for all of the datasets (D1, D2 & D3), both with and without feature selection. The evaluation is done by taking the average of 10 runs, each with a 10-fold cross validation procedure. In the second experiment, the performance of the classifiers on the best dataset is compared with an out-of-time test set.

6.1 Full Experiment With Cross-validation

A comparison of the accuracy and the AUC is displayed in Table LABEL:tab:aresults and LABEL:tab:results for all of the above mentioned classifiers. The tests were run 10 times, each time with stratified 10-fold cross validation (10CV), both with and without feature selection (FS). This process is depicted in Figure 3. As mentioned in Section 4.2, AUC is a more suited measure since the datasets are not entirely balanced (Fawcett, 2004), yet both are displayed to be complete. During the cross validation procedure, the dataset is divided into 10 folds. 9 of them are used for model building and 1 for testing. This procedure is repeated 10 times. The displayed AUC and accuracy in this subsection are the average results over the 10 test sets and the 10 runs. The resulting model is built on the entire dataset and can be expected to have a performance which is at least as good as the 10CV performance. A total of 10 runs were performed with the 10CV prodedure and the average results are displayed in Table LABEL:tab:aresults and LABEL:tab:results. A Wilcoxon signed-rank test is conducted to compare the performance of the models with the best performing model. The null hypothesis of this test states: “There is no difference in the performance of a model with the best model”.

As described in the previous section, decision trees and rulesets do not always offer the most accurate classification results, but their main advantage is their comprehensibility (Craven and Shavlik, 1996). It is rather surprising that support vector machines do not perform very well on this particular problem. The overall best technique seems to be the logistic regression, closely followed by naive Bayes. Another conclusion from the table is that feature selection seems to have a positive influence on the AUC for D1 and D3. As expected, the overall best results when taking into account both AUC and accuracy can be obtained using the dataset with the biggest gap, namely D1.

The overall best model seems to be logistic regression. The receiver operating curve (ROC) is displayed in Figure 8. The ROC curve displays the trade-off between true positive rate (TPR) and false negative rate (FNR) of the logistic classifier with 10-fold cross validation for D1 (FS). The model clearly scores better than a random classification, which is represented by the diagonal through the origin.

The confusion matrix of the logistic regression shows that 209 hits (i.e. 83% of the actual hits) were accurately classified as hits and 47 non-hits classified as non-hits (i.e. 32% of the actual non-hits). Yet overall, the model is able to make a fairly good distinction between classes, which proves that the dance hit prediction problem can be tackled as realistic top 10 versus top 30-40 classification problem with logistic regression.

6.2 Experiment With Out-of-time Test Set

A second experiment was conducted with an out-of-time test set based on D1 with feature selection. The instances were first ordered by date, and then split into a 90% training and 10% test set. Table LABEL:fig:results2 confirms the good performance of the logistic regression. A peculiar observation from this table is that the model seems to be able to predict better for newer songs (AUC: 0.81 versus 0.65). This can be due to coincidence, different class distribution between training and test set (see Figure 9) or the structure of the dataset. One speculation of the authors is that the oldest instances of the dataset might be “lingering” hits, meaning that they were top 10 hits on a date before the earliest entry in the dataset, and were still present in a low position in the used hit listings. These songs would be falsely seen as non-hits, which might cause the model to predict less good for older songs.

7 Conclusion

Multiple models were built that can successfully predict if a dance song is going to be a top 10 hit versus a lower positioned dance song. In order to do this, hit listings from two chart magazines were collected and mapped to audio features provided by The Echo Nest. Standard audio features were used, as well as more advanced features that capture the temporal aspect. This resulted in a model that could accurately predict top 10 dance hits.

This research proves that popularity of dance songs can be learnt from the analysis of music signals. Previous less successful results in this field speculate that their results could be due to features that are not informative enough (Pachet and Roy, 2008). The positive results from this paper could indeed be due to the use of more advanced temporal features. A second cause might be the use of “recent” songs only, which eliminates the fact that hit music evolves over time. It might also be due to the nature of dance music or that by focussing on one particular style of music, any noise created by classifying hits of different genres is reduced. Finally, by comparing different classifiers that have significantly different results in performance, the best model could be selected.

This model was implemented in an online application where users can upload their audio data and get the probability of it being a hit666http://antor.ua.ac.be/dance. An interesting future expansion would be to improve the accuracy of the model by including more features such as lyrics, social network information and others. The model could also be expanded to predict hits of other musical styles. In the line of research being done with automatic composition systems (Herremans and Sörensen, 2013), it is also interesting to see if the classification models from this paper could be included in an optimization function (e.g., a type of fitness function) and used to generate new dance hits or improve existing ones.

Acknowledgement

This research has been partially supported by the Interuniversity Attraction Poles (IUAP) Programme initiated by the Belgian Science Policy Office (COMEX project).

Bibliography67

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Baesens et al. [2003] B. Baesens, R. Setiono, C. Mues, and J. Vanthienen. Using neural network rule extraction and decision tables for credit-risk evaluation. Management Science , 49(3):312–329, 2003.
2Bertin-Mahieux et al. [2011] T. Bertin-Mahieux, D.P.W. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011) , 2011.
3Borg and Hokkanen [2011] N. Borg and G. Hokkanen. What makes for a hit pop song? what makes for a pop song? 2011. URL http://cs 229.stanford.edu/proj 2011/Borg Hokkanen-What Makes For A Hit Pop Song.pdf .
4Braheny [2007] J. Braheny. Craft and Business of Songwriting 3rd Edition (Craft & Business of Songwriting) . F & W Publications, 2007.
5Casey et al. [2008] M.A. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney. Content-based music information retrieval: Current directions and future challenges. Proceedings of the IEEE , 96(4):668–696, 2008.
6Cohen [1995] W. Cohen. Fast effective rule induction. In Armand Prieditis and Stuart Russell, editors, Proceedings of the 12th International Conference on Machine Learning , pages 115–123, Tahoe City, CA, 1995. Morgan Kaufmann Publishers.
7Cramer et al. [1976] G.M. Cramer, R.A. Ford, and R.L. Hall. Estimation of toxic hazard—a decision tree approach. Food and cosmetics toxicology , 16(3):255–276, 1976.
8Craven and Shavlik [1996] M.W. Craven and J.W. Shavlik. Extracting tree-structured representations of trained networks. Advances in neural information processing systems , 8:24–30, 1996.