How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
Sara Papi, Peter Polak, Ond\v{r}ej Bojar, Dominik Mach\'a\v{c}ek

TL;DR
This paper critically reviews the current state of real-time speech-to-text translation, highlighting gaps between research and real-world applications, and proposes standardized terminology, analysis, and future directions to improve the field.
Contribution
It introduces a standardized taxonomy for SimulST, analyzes community trends, and provides concrete recommendations to enhance research relevance and practical deployment.
Findings
Identified inconsistencies and limitations in current SimulST research
Proposed a standardized terminology and taxonomy for the field
Offered actionable recommendations for future research and system development
Abstract
Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker's speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
