SparQLe: Speech Queries to Text Translation Through LLMs
Amirbek Djanibekov, Hanan Aldarmaki

TL;DR
This paper presents SparQLe, a novel method that combines self-supervised speech representations with instruction-tuned LLMs via a modality adapter, enabling effective speech-to-text translation and semantic preservation.
Contribution
It introduces a new approach that aligns speech features with instruction-tuned LLMs, enhancing speech understanding and translation capabilities.
Findings
Effective preservation of semantic content in speech-to-text translation
Successful integration of self-supervised speech models with instruction-tuned LLMs
Potential for improved multi-modal speech understanding applications
Abstract
With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that combines self-supervised speech representations with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English speech data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising approach for various speech understanding applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Library Science and Information Systems · Mathematics, Computing, and Information Processing
MethodsAdapter · ALIGN
