Many uses, many annotations for large speech corpora: Switchboard and TDT as case studies
David Graff, Steven Bird

TL;DR
This paper examines the challenges of managing diverse annotations in large speech corpora through two case studies, proposing a framework to address issues like consistency and integration.
Contribution
It introduces a general framework for handling multiple, diverse annotations in large speech corpora, demonstrated through two detailed case studies.
Findings
Identified key challenges in annotation propagation and consistency.
Proposed a framework to integrate diverse annotations effectively.
Validated the framework with case studies on Switchboard and TDT2.
Abstract
This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
