Using Multiple Sources of Information for Constraint-Based Morphological Disambiguation
Gokhan Tur

TL;DR
This thesis introduces a constraint-based morphological disambiguation system for complex languages like Turkish, combining handcrafted rules, learned constraints, and statistical data to achieve high accuracy and low ambiguity.
Contribution
It presents a novel multi-source approach that integrates rule-based, learned, and statistical information for morphological disambiguation in agglutinative languages.
Findings
Achieved 96-97% recall and 93-94% precision in disambiguation
Reduced unknown words to below 1% with secondary processing
Attained low ambiguity of about 1.02 to 1.03 parses per token
Abstract
This thesis presents a constraint-based morphological disambiguation approach that is applicable to languages with complex morphology--specifically agglutinative languages with productive inflectional and derivational morphological phenomena. For morphologically complex languages like Turkish, automatic morphological disambiguation involves selecting for each token morphological parse(s), with the right set of inflectional and derivational markers. Our system combines corpus independent hand-crafted constraint rules, constraint rules that are learned via unsupervised learning from a training corpus, and additional statistical information obtained from the corpus to be morphologically disambiguated. The hand-crafted rules are linguistically motivated and tuned to improve precision without sacrificing recall. In certain respects, our approach has been motivated by Brill's recent work, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Second Language Acquisition and Learning
