Leveraging Newswire Treebanks for Parsing Conversational Data with Argument Scrambling
Riyaz Ahmad Bhat, Irshad Ahmad Bhat, Dipti Misra Sharma

TL;DR
This paper addresses the challenge of parsing morphologically-rich conversational Hindi data with argument scrambling by augmenting training data with all possible word order transformations, significantly improving parser performance.
Contribution
It introduces a novel data augmentation method based on transformational grammar to enhance dependency parsing of conversational data in morphologically-rich languages.
Findings
Training on canonical and transformed structures improves LAS by 9%.
Parser performance degrades when trained only on newswire data.
Transformational data augmentation reduces bias towards canonical structures.
Abstract
We investigate the problem of parsing conversational data of morphologically-rich languages such as Hindi where argument scrambling occurs frequently. We evaluate a state-of-the-art non-linear transition-based parsing system on a new dataset containing 506 dependency trees for sentences from Bollywood (Hindi) movie scripts and Twitter posts of Hindi monolingual speakers. We show that a dependency parser trained on a newswire treebank is strongly biased towards the canonical structures and degrades when applied to conversational data. Inspired by Transformational Generative Grammar, we mitigate the sampling bias by generating all theoretically possible alternative word orders of a clause from the existing (kernel) structures in the treebank. Training our parser on canonical and transformed structures improves performance on conversational data by around 9% LAS over the baseline newswire…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
