Coupling Speech Encoders with Downstream Text Models

Ciprian Chelba; Johan Schalkwyk

arXiv:2407.17605·cs.CL·July 26, 2024

Coupling Speech Encoders with Downstream Text Models

Ciprian Chelba, Johan Schalkwyk

PDF

Open Access

TL;DR

This paper introduces a modular cascade speech translation model with an 'exporter' layer that aligns speech encoder embeddings with text model embeddings, ensuring no worse performance than the baseline and enabling gradient flow for improved integration.

Contribution

The novel 'exporter' layer trained with L2-loss allows coupling speech encoders with text models while maintaining baseline performance and facilitating end-to-end training.

Findings

01

Significant improvement over 1-best cascade in certain scenarios

02

Performance gain diminishes with incremental MT training

03

Approach applicable to coupling ASR encoders with large language models

Abstract

We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task. Our novel contribution is the use of an ``exporter'' layer that is trained under L2-loss to ensure a strong match between ASR embeddings and the MT token embeddings for the 1-best sequence. The ``exporter'' output embeddings are fed directly to the MT model in lieu of 1-best token embeddings, thus guaranteeing that the resulting model performs no worse than the 1-best cascade baseline, while allowing back-propagation gradient to flow from the MT model into the ASR components. The matched-embeddings cascade architecture provide a significant improvement over its 1-best counterpart in scenarios where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems