A Textless Metric for Speech-to-Speech Comparison
Laurent Besacier, Swen Ribeiro, Olivier Galibert, Ioan Calapodescu

TL;DR
This paper presents a novel textless speech comparison metric using speech2unit encoders, enabling effective evaluation of speech translation without relying on transcriptions, especially useful for low-resource languages.
Contribution
It introduces a simple neural architecture for speech-to-speech comparison that aligns closely with text-based metrics, bypassing the need for ASR transcriptions.
Findings
The proposed metric correlates well with text-based BLEU scores.
ASR-BLEU is shown to be a poor proxy for actual translation quality.
The method is applicable to languages lacking reliable ASR systems.
Abstract
In this paper, we introduce a new and simple method for comparing speech utterances without relying on text transcripts. Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units. We then propose a simple and easily replicable neural architecture that learns a speech-based metric that closely corresponds to its text-based counterpart. This textless metric has numerous potential applications, including evaluating speech-to-speech translation for oral languages, languages without dependable ASR systems, or to avoid the need for ASR transcription altogether. This paper also shows that for speech-to-speech translation evaluation, ASR-BLEU (which consists in automatically transcribing both speech hypothesis and reference and compute sentence-level BLEU between transcripts) is a poor proxy to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
