ODSQA: Open-domain Spoken Question Answering Dataset

Chia-Hsuan Lee; Shang-Ming Wang; Huan-Cheng Chang; Hung-Yi; Lee

arXiv:1808.02280·cs.CL·August 8, 2018

ODSQA: Open-domain Spoken Question Answering Dataset

Chia-Hsuan Lee, Shang-Ming Wang, Huan-Cheng Chang, Hung-Yi, Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces ODSQA, the largest real-world spoken question answering dataset, and explores the impact of ASR errors and data augmentation techniques to improve machine comprehension of spoken content.

Contribution

The paper releases the first large-scale real spoken QA dataset and investigates methods to mitigate ASR errors, including subword units and data augmentation.

Findings

01

ASR errors severely impact spoken QA performance

02

Subword units improve robustness across models

03

Data augmentation enhances spoken QA accuracy

Abstract

Reading comprehension by machine has been widely studied, but machine comprehension of spoken content is still a less investigated problem. In this paper, we release Open-Domain Spoken Question Answering Dataset (ODSQA) with more than three thousand questions. To the best of our knowledge, this is the largest real SQA dataset. On this dataset, we found that ASR errors have catastrophic impact on SQA. To mitigate the effect of ASR errors, subword units are involved, which brings consistent improvements over all the models. We further found that data augmentation on text-based QA training examples can improve SQA.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chiahsuan156/ODSQA
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques