# The Cadenza lyric intelligibility prediction (CLIP) dataset

**Authors:** Gerardo Roa-Dabike, Trevor J. Cox, Jon P. Barker, Bruno M. Fazenda, Simone Graetzer, Rebecca R. Vos, Michael A. Akeroyd, Jennifer Firth, William M. Whitmer, Scott Bannister, Alinka Greasley

PMC · DOI: 10.1016/j.dib.2026.112466 · Data in Brief · 2026-01-14

## TL;DR

This paper introduces CLIP, a large dataset for predicting lyric intelligibility in music using machine learning, designed for the Cadenza ICASSP 2026 Signal Processing Grand Challenge.

## Contribution

The paper introduces CLIP, the first publicly available large-scale dataset for predicting lyric intelligibility in music.

## Key findings

- CLIP contains 11,072 music signals with ground truth lyrics and intelligibility scores.
- The dataset includes simulated hearing loss conditions to represent diverse hearing abilities.
- Transcriptions were collected from native English speakers to ensure accurate ground truth.

## Abstract

This paper presents CLIP, a dataset of 11,072 popular western music signals sourced from independent artists, accompanied by ground truth lyrics, and lyric intelligibility scores from listening tests. The dataset is designed to facilitate music information retrieval (MIR) research using machine learning. It was created to allow the development of algorithms to predict lyric intelligibility for the Cadenza ICASSP 2026 Signal Processing Grand Challenge. Currently, it is the only publicly available large-scale dataset for such a task. The music was sourced from the Free Music Archive (FMA) dataset and is unlikely to be familiar to listeners. We excluded tracks whose license did not allow derivative works and those that did not have English singing. Ground truth transcriptions were generated by seven native English speakers, resulting in 3700 excerpts of 5 to 10 words each from 1452 different songs. A hearing loss simulation was also applied to the stereo audio. This resulted in 11,100 music signals with no, mild or moderate hearing loss. This was done so more diverse hearing is represented in the dataset. Human transcriptions were then collected via an online listening experiment. Participants self-reported as having normal-hearing and being native English speakers. They listened to each music signal twice before transcribing each line. Final intelligibility scores were the ratio of matching words between the listening test responses and the ground truth transcriptions. The final dataset consists of audio, ground truth lyrics, intelligibility scores and associated metadata.

## Full-text entities

- **Diseases:** hearing loss (MESH:D034381)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12861282/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12861282/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/PMC12861282/full.md

---
Source: https://tomesphere.com/paper/PMC12861282