Coupling Speech Encoders with Downstream Text Models
Ciprian Chelba, Johan Schalkwyk

TL;DR
This paper introduces a modular cascade speech translation model with an 'exporter' layer that aligns speech encoder embeddings with text model embeddings, ensuring no worse performance than the baseline and enabling gradient flow for improved integration.
Contribution
The novel 'exporter' layer trained with L2-loss allows coupling speech encoders with text models while maintaining baseline performance and facilitating end-to-end training.
Findings
Significant improvement over 1-best cascade in certain scenarios
Performance gain diminishes with incremental MT training
Approach applicable to coupling ASR encoders with large language models
Abstract
We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task. Our novel contribution is the use of an ``exporter'' layer that is trained under L2-loss to ensure a strong match between ASR embeddings and the MT token embeddings for the 1-best sequence. The ``exporter'' output embeddings are fed directly to the MT model in lieu of 1-best token embeddings, thus guaranteeing that the resulting model performs no worse than the 1-best cascade baseline, while allowing back-propagation gradient to flow from the MT model into the ASR components. The matched-embeddings cascade architecture provide a significant improvement over its 1-best counterpart in scenarios where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
