# On the representation of speech and music

**Authors:** David N. Levin

arXiv: 1905.03278 · 2019-05-10

## TL;DR

This paper introduces a speaker- and instrument-independent inner time series representation of speech and music, simplifying automatic recognition by focusing on invariant features unaffected by speaker or instrument differences.

## Contribution

It demonstrates that an inner time series can encode speech and music in a speaker- and instrument-independent manner, reducing training complexity for recognition systems.

## Key findings

- Inner time series is invariant across speakers and instruments.
- Training on one speaker's data suffices for speaker-independent recognition.
- Music experiments show instrument independence in the inner time series.

## Abstract

In most automatic speech recognition (ASR) systems, the audio signal is processed to produce a time series of sensor measurements (e.g., filterbank outputs). This time series encodes semantic information in a speaker-dependent way. An earlier paper showed how to use the sequence of sensor measurements to derive an "inner" time series that is unaffected by any previous invertible transformation of the sensor measurements. The current paper considers two or more speakers, who mimic one another in the following sense: when they say the same words, they produce sensor states that are invertibly mapped onto one another. It follows that the inner time series of their utterances must be the same when they say the same words. In other words, the inner time series encodes their speech in a manner that is speaker-independent. Consequently, the ASR training process can be simplified by collecting and labelling the inner time series of the utterances of just one speaker, instead of training on the sensor time series of the utterances of a large variety of speakers. A similar argument suggests that the inner time series of music is instrument-independent. This is demonstrated in experiments on monophonic electronic music.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.03278/full.md

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/1905.03278/full.md

## References

4 references — full list in the complete paper: https://tomesphere.com/paper/1905.03278/full.md

---
Source: https://tomesphere.com/paper/1905.03278