MeetDot: Videoconferencing with Live Translation Captions
Arkady Arkhangorodsky, Christopher Chu, Scot Fang, Yiqi Huang, Denglin, Jiang, Ajay Nagesh, Boliang Zhang, Kevin Knight

TL;DR
MeetDot is a videoconferencing system that provides live translation captions in multiple languages, aiming to improve multilingual communication by integrating ASR and MT with user-friendly features and evaluation tools.
Contribution
The paper introduces MeetDot, a modular, open-source videoconferencing system with real-time translation captions, optimized for low latency and user experience, and includes novel evaluation metrics.
Findings
Supports 4 languages with integrated ASR and MT
Features smooth scrolling and flicker reduction for better user experience
Includes an innovative cross-lingual word-guessing game for system evaluation
Abstract
We present MeetDot, a videoconferencing system with live translation captions overlaid on screen. The system aims to facilitate conversation between people who speak different languages, thereby reducing communication barriers between multilingual participants. Currently, our system supports speech and captions in 4 languages and combines automatic speech recognition (ASR) and machine translation (MT) in a cascade. We use the re-translation strategy to translate the streamed speech, resulting in caption flicker. Additionally, our system has very strict latency requirements to have acceptable call quality. We implement several features to enhance user experience and reduce their cognitive load, such as smooth scrolling captions and reducing caption flicker. The modular architecture allows us to integrate different ASR and MT services in our backend. Our system provides an integrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Video Analysis and Summarization · Multimodal Machine Learning Applications
