Prepending or Cross-Attention for Speech-to-Text? An Empirical   Comparison

Tsz Kin Lam; Marco Gaido; Sara Papi; Luisa Bentivogli; Barry Haddow

arXiv:2501.02370·cs.CL·February 11, 2025

Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison

Tsz Kin Lam, Marco Gaido, Sara Papi, Luisa Bentivogli, Barry Haddow

PDF

Open Access 1 Video

TL;DR

This paper empirically compares dense feature prepending and cross-attention architectures for speech-to-text tasks, finding no clear advantage of DFP over cross-attention across various configurations and datasets.

Contribution

It provides a controlled, comprehensive comparison of DFP and cross-attention architectures for speech-to-text, trained from scratch on multiple datasets.

Findings

01

No clear advantage of DFP over cross-attention in speech-to-text tasks.

02

Controlled experiments with comparable data and parameters.

03

Evaluation on monolingual, bilingual, and multilingual models.

Abstract

Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech -- the most common form of communication. The most widespread approach to integrating speech into LLMs is dense feature prepending (DFP), which prepends the projected speech representations to the textual representations, allowing end-to-end training with a speech encoder. This raises questions about the need for a sophisticated speech encoder for DFP and how its performance compares with a standard encoder-decoder (i.e., cross-attention) architecture. We compare DFP and cross-attention under a variety of configurations, such as CTC compression, sequence-level knowledge distillation, on monolingual, bilingual, and multilingual models. To perform a controlled architectural comparison, we train all models from scratch rather than using large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison· underline

Taxonomy

TopicsSpeech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings