Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Tsz Kin Lam, Marco Gaido, Sara Papi, Luisa Bentivogli, Barry Haddow

TL;DR
This paper empirically compares dense feature prepending and cross-attention architectures for speech-to-text tasks, finding no clear advantage of DFP over cross-attention across various configurations and datasets.
Contribution
It provides a controlled, comprehensive comparison of DFP and cross-attention architectures for speech-to-text, trained from scratch on multiple datasets.
Findings
No clear advantage of DFP over cross-attention in speech-to-text tasks.
Controlled experiments with comparable data and parameters.
Evaluation on monolingual, bilingual, and multilingual models.
Abstract
Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech -- the most common form of communication. The most widespread approach to integrating speech into LLMs is dense feature prepending (DFP), which prepends the projected speech representations to the textual representations, allowing end-to-end training with a speech encoder. This raises questions about the need for a sophisticated speech encoder for DFP and how its performance compares with a standard encoder-decoder (i.e., cross-attention) architecture. We compare DFP and cross-attention under a variety of configurations, such as CTC compression, sequence-level knowledge distillation, on monolingual, bilingual, and multilingual models. To perform a controlled architectural comparison, we train all models from scratch rather than using large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
