TL;DR
This paper introduces H3-Conformer, a hybrid model combining H3 and MHSA to improve long-form speech recognition efficiency and robustness, outperforming traditional Transformer-based models on CSJ and LibriSpeech datasets.
Contribution
The paper proposes a novel hybrid H3-Conformer model that replaces or complements self-attention with H3 for better long-form speech processing, including a parallel H3-MHSA configuration.
Findings
H3-Conformer achieves efficient long-form speech recognition.
Hybrid H3 and MHSA layers improve online recognition performance.
Parallel H3 and MHSA use yields the best results.
Abstract
Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square order of the input length. To solve the problem, we incorporate a kind of state-space model, Hungry Hungry Hippos (H3), to replace or complement the multi-head self-attention (MHSA). H3 allows for efficient modeling of long-form sequences with a linear-order computation. In experiments using two datasets of CSJ and LibriSpeech, our proposed H3-Conformer model performs efficient and robust recognition of long-form speech. Moreover, we propose a hybrid of H3 and MHSA and show that using H3 in higher layers and MHSA in lower layers provides significant improvement in online recognition. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
