Building Tamil Treebanks
Kengatharaiyer Sarveswaran

TL;DR
This paper explores methods for creating Tamil treebanks, highlighting manual, grammar-based, and machine learning approaches, and discusses the challenges faced in developing these linguistic resources for NLP applications.
Contribution
It presents a comprehensive overview of three different approaches to building Tamil treebanks and discusses associated challenges and solutions.
Findings
Manual annotation ensures high-quality linguistic data.
Machine learning approaches enable large-scale automated annotation.
Challenges include data quality, linguistic complexity, and resource availability.
Abstract
Treebanks are important linguistic resources, which are structured and annotated corpora with rich linguistic annotations. These resources are used in Natural Language Processing (NLP) applications, supporting linguistic analyses, and are essential for training and evaluating various computational models. This paper discusses the creation of Tamil treebanks using three distinct approaches: manual annotation, computational grammars, and machine learning techniques. Manual annotation, though time-consuming and requiring linguistic expertise, ensures high-quality and rich syntactic and semantic information. Computational deep grammars, such as Lexical Functional Grammar (LFG), offer deep linguistic analyses but necessitate significant knowledge of the formalism. Machine learning approaches, utilising off-the-shelf frameworks and tools like Stanza, UDpipe, and UUParser, facilitate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Linguistics, Cultural Analysis · African history and culture analysis
