# Multi-Frame Cross-Entropy Training for Convolutional Neural Networks in   Speech Recognition

**Authors:** Tom Sercu, Neil Mallinar

arXiv: 1907.13121 · 2019-08-01

## TL;DR

This paper proposes Multi-Frame Cross-Entropy training for CNNs in speech recognition, enabling the model to learn from multiple frames simultaneously, leading to significant word error rate improvements on benchmark datasets.

## Contribution

It introduces a novel training method that leverages multiple frames at once for CNN acoustic models, enhancing learning efficiency and accuracy.

## Key findings

- Large WER reductions on hub5 and rt02 datasets
- Effective multi-frame training with minimal additional computation
- Improved speech recognition performance on Switchboard benchmark

## Abstract

We introduce Multi-Frame Cross-Entropy training (MFCE) for convolutional neural network acoustic models. Recognizing that similar to RNNs, CNNs are in nature sequence models that take variable length inputs, we propose to take as input to the CNN a part of an utterance long enough that multiple labels are predicted at once, therefore getting cross-entropy loss signal from multiple adjacent frames. This increases the amount of label information drastically for small marginal computational cost. We show large WER improvements on hub5 and rt02 after training on the 2000-hour Switchboard benchmark.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.13121/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/1907.13121/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/1907.13121/full.md

---
Source: https://tomesphere.com/paper/1907.13121