Fast computation of loudness using a deep neural network
Josef Schlittenlacher, Richard E. Turner, Brian C. J. Moore

TL;DR
This paper presents a deep neural network that predicts instantaneous loudness from sound waveforms, achieving real-time performance with high accuracy by approximating a complex loudness model.
Contribution
The authors develop a DNN that accurately and rapidly predicts loudness, enabling real-time applications and demonstrating the potential of neural networks to simulate complex perceptual models.
Findings
DNN predicts loudness with less than 0.5 phon deviation.
DNN performs over 100,000 computations per second.
Approach can be applied to other perceptual models.
Abstract
The present paper introduces a deep neural network (DNN) for predicting the instantaneous loudness of a sound from its time waveform. The DNN was trained using the output of a more complex model, called the Cambridge loudness model. While a modern PC can perform a few hundred loudness computations per second using the Cambridge loudness model, it can perform more than 100,000 per second using the DNN, allowing real-time calculation of loudness. The root-mean-square deviation between the predictions of instantaneous loudness level using the two models was less than 0.5 phon for unseen types of sound. We think that the general approach of simulating a complex perceptual model by a much faster DNN can be applied to other perceptual models to make them run in real time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
Methodspc
Fast computation of loudness using a deep neural network
Josef Schlittenlacher, Richard, E. Turner, Brian C. J. Moore
(University of Cambridge
js2251,ret26,[email protected]
)
The present paper introduces a deep neural network (DNN) for predicting the instantaneous loudness of a sound from its time waveform. The DNN was trained using the output of a more complex model, called the Cambridge loudness model. While a modern PC can perform a few hundred loudness computations per second using the Cambridge loudness model, it can perform more than 100,000 per second using the DNN, allowing real-time calculation of loudness. The root-mean-square deviation between the predictions of instantaneous loudness level using the two models was less than 0.5 phon for unseen types of sound. We think that the general approach of simulating a complex perceptual model by a much faster DNN can be applied to other perceptual models to make them run in real time.
1 Introduction
Accurate models for predicting perceptual attributes of sound (such as loudness) from their physical characteristics can have high computational cost, often making it hard or impossible to run them in real time. For example in the auditory domain, a good model needs to estimate the input to the auditory nerve, which is determined by the excitation pattern in the cochlea. The excitation at a given place in the cochlea is a non-linear function of the sound’s momentary spectrum and depends not only on frequency, but also on level and interactions between adjacent frequencies. In addition, transformations are needed between various scales, some of which are neither linear nor logarithmic.
One of the most advanced loudness models [1, 2] (see [3] for an overview, or [4] for the most recent update), which we call the Cambridge loudness model, uses the time waveform of a sound as the input and calculates three quantities: (1) Instantaneous loudness, which is the momentary loudness calculated from a given frame of the sound and which is assumed not to be available for conscious perception; (2) Short-term loudness, which is the loudness of a short segment of the sound, such as a single syllable in a sentence; (3) Long-term loudness, which the overall loudness impression of a longer segment of sound, such as a whole sentence. The most computationally intensive step is step (1). In the model, the instantaneous loudness is updated every 1 ms, a rate that is necessary to accommodate the temporal resolution of the auditory system [5]. However, on a modern PC it is only possible to calculate instantaneous loudness a few hundred times per second. This means that the Cambridge loudness model cannot be run in real time. Furthermore, it would be desirable to have a computation speed that is much faster than real time, for example when calculating the time-varying loudness of long recordings of sound (sometimes durations of days or weeks are needed to evaluate environmental noise), or when estimating individual model parameters in an active-learning test [6], where as many evaluations as possible within an acceptable inter-trial interval of less than about two seconds are desired.
For this reason we developed a deep neural network (DNN) for predicting instantaneous loudness from a given input spectrum, using instantaneous loudness calculated from the Cambridge loudness model as a reference for training. Predicted values were expressed as loudness level in phon; the loudness level of a given sound is defined as the sound pressure level of an equally loud 1-kHz tone presented in free field with frontal incidence. After training, the root-mean-square (RMS) difference between the loudness level predicted by the DNN and by the Cambridge loudness model was less than 0.5 phon for sounds of unseen categories. This error is somewhat below the just noticeable difference. Our implementation in Keras/TensorFlow can calculate instantaneous loudness more than 100,000 times per second on a CPU (i7 6700k).
2 Model
Apart from accuracy, computation speed was the main consideration when designing the DNN. The Cambridge loudness model estimates the short-term spectrum in each frame using six Fourier Transforms in parallel, each being used to estimate the spectrum in a limited frequency region. For the DNN, the input was a simplified spectrum with 61 components covering the frequency range up to 8 kHz (constant-width bins up to 200 Hz, nine bins per octave above 200 Hz). The limit of 8 kHz was chosen due to the limited sampling rate of the training material.
The output was a single loudness level estimate in phon. This scale was chosen because of its similarity to the input scale, which was measured in decibels. The two scales range roughly from 0 to 100 (between the detection threshold and the point at which sounds become uncomfortably loud), and the just noticeable difference in loudness is roughly constant on these scales. This made it easier for the DNN to develop a mapping from input to output without the need for the scale transformations and summations across frequency that are required in the Cambridge loudness model. Furthermore, the use of the phon scale as output made it possible to use simple ReLU activations [7]. When operating an auditory DNN on other scales, for example the waveform, the combination of sigmoid and hyperbolic tangent can give better results [8].
The DNN was a multilayer perceptron (MLP) that consisted of an input layer with 61 units, three hidden layers with 150 units each, and a single output unit with linear activation. It was optimized with regard to the mean square difference between the DNN and the Cambridge loudness model. The Adam optimizer [9] was used with its default parameters. All weights were initialized randomly.
Alternative architectures were also evaluated. Convolutional neural networks did not achieve the same accuracy. A likely reason for this is that the input scale (logarithmic frequency) differs from the ERB-number scale, which is a perceptually relevant frequency scale based on estimates of the bandwidths of the auditory filters [10], and thus filters for low and high frequencies need considerably different shapes.
The training data consisted of three different types of sounds. First, 500,000 spectra were calculated from the LibriSpeech corpus [11], using the “clean” development set. The sounds were scaled to have an overall RMS level of 60 dB SPL. Spectra were calculated every 35 ms (560 samples) using a 1024-point discrete Fourier Transform (DFT). Second, about 700,000 pure tones with levels ranging from -15 to 110 dB SPL and various levels of background noise were generated. Each component of the background noise was at least 10 dB lower than the level of the pure tone. Third, about 500,000 spectra of band-limited noises and noises with spectral notches were generated. They had various overall levels, bandwidths, notch widths and spectral gradients. The DNN was first trained for 220 epochs, then for a further 780, and then for a further 4000.
3 Experiments
Loudness was predicted for two further sets of data from the LibriSpeech corpus, “clean” test and “other” test. Each of them consisted of 500,000 spectra and they were calibrated to have an RMS level of 60 dB SPL. Loudness was also predicted for 250,000 spectra from the ESC corpus [12]. This corpus contains 50 categories of environmental sounds, for example rain, animals, aircraft, keyboard typing or a washing machine. The sounds were again scaled to have an RMS level of 60 dB SPL. Furthermore, loudness was predicted for 100,000 spectra from 20 popular songs of the 1960s, which were scaled to have an RMS level of 70 dB SPL. The predicted loudness distributions are shown in Figure 1. For the speech sets, only results for the “clean” test are shown since the distributions were virtually the same for the “other” test. All loudness calculations of the Cambridge loudness model were based on the 1024-point DFT, while predictions of the DNN were based on the simplified 61-point input spectrum, which in turn was obtained from the DFT spectrum.
Table 1 shows the RMS difference in phon between the predictions of the Cambridge loudness model and the predictions of the DNN, which is referred to as the error. The RMS error for clean speech of 0.27 phon after 1000 epochs is virtually the same as the training error. The RMS error is less than 0.5 phon for “other” speech, which according to the developers of the corpus is somewhat more noisy, the music, and most notably for the environmental sounds. The value of 0.5 phon is similar to or below the just noticeable difference for loudness, i.e. most predictions deviate by an amount that is less than the amount needed for a human listener to distinguish them.
Figure 2 shows the predicted loudness level of pure tones in quiet as a function of input sound level. The lowest loudness level predicted by the Cambridge loudness model was limited to 0 phon, since the threshold in quiet corresponds to about 2 phon for a normal-hearing listener. The loudness level is systematically higher for the 3-kHz tone than for the 1-kHz tone because 3 kHz is near to the resonant frequency of the ear canal. The threshold is about 20 dB higher at 100 Hz than it is at 1 kHz, but the difference in loudness loudness decreases with increasing level. All these predicted effects correspond well to loudness judgments obtained from human listeners.
Figure 3 shows the loudness level of bandpass filtered pink noise centered at 1 kHz, plotted as a function of bandwidth, as predicted by the Cambridge loudness model and the DNN. The predictions of the DNN are a little below of the Cambridge loudness model, especially for small bandwidths. These deviations are probably due to the fact that the DNN does not sum the loudness density across frequency at any stage, but rather performs a regression from the 61 input levels to the output loudness levels. It is of interest, however, that the results predicted by the DNN are more consistent with recent psychophysical results [13]. Note that the sounds used for figures 2 and 3 were presented to the DNN during training. The predictions for these sounds are shown because the effects of frequency and spectral summation are fundamental aspects of loudness.
4 Conclusions
The predictions for the environmental sounds and music are remarkably accurate given that the DNN was trained using speech and synthetic sounds only. This suggests that the DNN generalizes well to real-world sounds. The predictions for music with slightly higher loudness levels showed that the DNN also works well for levels to which it has been exposed less frequently. Training using pure tones and noises ensured that the effects of level, frequency and spectral loudness summation would be represented adequately, and probably led to better generalization than training solely using speech. Using an adversarial example [14], it might be possible to find spectra for which predictions of the Cambridge loudness model and the DNN deviate more. We leave this for a future study and conclude for now that the DNN generalizes well to a variety of real-world sounds.
In summary, we developed and evaluated a DNN that was trained using the predictions of a computationally more expensive model, the Cambridge loudness model. The gain in computational speed was a factor of more than 100, enabling computation much faster than real-time, while predictions were almost the same. This allows real-time prediction of loudness with accuracy comparable to that for the Cambridge loudness model. The DNN would also be useful for the analysis of large amounts of pre-recorded data. The approach of using a DNN to approximate a perceptual model could readily be extended to searches for individual model parameters in efficient hearing tests. Another extension could be in devices like hearing aids and cochlear implants to allow hearing to be restored more nearly to normal.
Acknowledgments
The work was supported by the Engineering and Physical Sciences Research Council (UK, grant number RG78536).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] I. 532-2, “Acoustics – methods for calculating loudness – part 2: Moore-Glasberg method,” 2016.
- 2[2] I. 532-3, “Acoustics – methods for calculating loudness – part 3: Moore-Glasberg-Schlittenlacher method for time varying sounds,” 2019.
- 3[3] B. C. Moore, “Development and current status of the cambridge loudness models,” Trends in hearing , vol. 18, pp. 1–29, 2014.
- 4[4] B. C. Moore, M. Jervis, L. Harries, and J. Schlittenlacher, “Testing and refining a loudness model for time-varying sounds incorporating binaural inhibition,” The Journal of the Acoustical Society of America , vol. 143, no. 3, pp. 1504–1513, 2018.
- 5[5] E. Zwicker, “Die Zeitkonstanten (Grenzdauern) des Gehörs (Time constants (characteristic durations) of hearing),” Zeitschrift für Hörgeräte-Akustik , vol. 13, pp. 82–102, 1974.
- 6[6] J. Schlittenlacher, R. E. Turner, and B. C. Moore, “A hearing-model-based active-learning test for the determination of dead regions,” Trends in hearing , vol. 22, pp. 1–13, 2018.
- 7[7] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10) , 2010, pp. 807–814.
- 8[8] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” ar Xiv preprint ar Xiv:1609.03499 , 2016.
