BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alexander Waibel, Jan Niehues

TL;DR
BOOM is a multimodal multilingual lecture system that translates and localizes audio and slides, providing synchronized outputs across text, slides, and speech to enhance accessibility and downstream educational tasks.
Contribution
It introduces an end-to-end multimodal translation system that preserves all lecture modalities, enabling comprehensive multilingual access to educational content.
Findings
Slide-aware transcripts improve summarization accuracy
The system effectively localizes slides with visual preservation
End-to-end approach benefits downstream tasks like question answering
Abstract
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Speech and dialogue systems
