SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus
Haoxu Wang, Fan Yu, Xian Shi, Yuezhang Wang, Shiliang, Zhang, Ming Li

TL;DR
SlideSpeech is a large-scale audio-visual dataset with synchronized slides, enabling new research in multi-modal speech recognition by leveraging slide text and images to improve accuracy.
Contribution
This paper introduces SlideSpeech, a novel large-scale corpus with synchronized slides and transcripts, and proposes baseline methods to incorporate slide text into speech recognition.
Findings
Incorporating slide text improves speech recognition accuracy.
The corpus contains over 1,000 hours of transcribed speech with synchronized slides.
Baseline methods demonstrate the potential of multi-modal integration.
Abstract
Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Video Analysis and Summarization · Music and Audio Processing
