Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature Engineering
Aris J. Aristorenas

TL;DR
This paper introduces a machine learning framework that uses advanced audio features to evaluate content similarity and predict sentiment scores, demonstrating promising results in media analysis applications.
Contribution
It presents a novel combination of feature extraction and regression modeling for sentiment prediction in audio content, with a new dataset of YouTube music covers and original songs.
Findings
Achieved low RMSE in sentiment score prediction across features
Demonstrated the effectiveness of MFCC, Chroma, Spectral Contrast, and Temporal features
Improved performance over baseline models
Abstract
This study presents a machine learning framework for assessing similarity between audio content and predicting sentiment score. We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments, serving as proxy labels for content quality. Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations through Mel-Frequency Cepstral Coefficients (MFCC), Chroma, Spectral Contrast, and Temporal characteristics. Leveraging these features, we train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively. Improvements over a baseline model based on absolute difference metrics are observed. These results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
