SemEval 2022 Task 12: Symlink- Linking Mathematical Symbols to their Descriptions
Viet Dac Lai, Amir Pouran Ben Veyseh, Franck Dernoncourt, Thien Huu, Nguyen

TL;DR
This paper introduces a new annotated corpus for punctuation restoration in livestreaming video transcripts, highlighting the challenges and limitations of existing NLP tools in this domain.
Contribution
It provides the BehancePR corpus and demonstrates the inadequacy of current NLP tools for punctuation restoration in livestreaming transcripts.
Findings
Existing NLP tools struggle with non-punctuated livestreaming transcripts.
The BehancePR corpus reveals the complexity of punctuation restoration in this domain.
Current models are insufficient, indicating a need for more robust solutions.
Abstract
Given the increasing number of livestreaming videos, automatic speech recognition and post-processing for livestreaming video transcripts are crucial for efficient data management as well as knowledge mining. A key step in this process is punctuation restoration which restores fundamental text structures such as phrase and sentence boundaries from the video transcripts. This work presents a new human-annotated corpus, called BehancePR, for punctuation restoration in livestreaming video transcripts. Our experiments on BehancePR demonstrate the challenges of punctuation restoration for this domain. Furthermore, we show that popular natural language processing toolkits are incapable of detecting sentence boundary on non-punctuated transcripts of livestreaming videos, calling for more research effort to develop robust models for this area.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
