OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Jeongkyun Park; Jung-Wook Hwang; Kwanghee Choi; Seung-Hyun Lee; Jun Hwan Ahn; Rae-Hong Park; Hyung-Min Park

arXiv:2301.06375·cs.MM·August 29, 2025

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi, Seung-Hyun Lee, Jun Hwan Ahn, Rae-Hong Park, Hyung-Min Park

PDF

Open Access 1 Repo

TL;DR

OLKAVS is the largest publicly available Korean audio-visual speech dataset, enabling advanced multi-modal research with extensive multi-view and noisy environment data, along with baseline models for speech recognition and lip reading.

Contribution

The paper introduces OLKAVS, the largest Korean audio-visual speech dataset with multi-view recordings and noise variations, and provides baseline models for key speech tasks.

Findings

01

Multi-modal and multi-view training improves performance.

02

OLKAVS enables research in Korean speech and speaker recognition.

03

Baseline models demonstrate dataset's effectiveness.

Abstract

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multi-view videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multi-view training over uni-modal and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iip-sogang/olkavs-avspeech
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Subtitles and Audiovisual Media