TL;DR
This paper introduces quaternion-valued convolutional neural networks for end-to-end speech recognition, leveraging quaternion algebra to process multidimensional features more effectively than traditional real-valued models.
Contribution
It proposes integrating multiple feature views into quaternion CNNs for sequence-to-sequence speech recognition with CTC, demonstrating improved performance with fewer parameters.
Findings
Lower phoneme error rate (PER) on TIMIT corpus
Fewer learning parameters needed compared to real-valued CNNs
Effective processing of multidimensional speech features
Abstract
Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
