Creating and Managing a large annotated parallel corpora of Indian   languages

Ritesh Kumar; Shiv Bhusan Kaushik; Pinkey Nainwani; Girish Nath Jha

arXiv:2112.01764·cs.CL·December 6, 2021

Creating and Managing a large annotated parallel corpora of Indian languages

Ritesh Kumar, Shiv Bhusan Kaushik, Pinkey Nainwani, Girish Nath Jha

PDF

Open Access

TL;DR

This paper discusses the development of a web-based annotation tool for creating and managing large parallel corpora of Indian languages, addressing challenges of consistency, scalability, and distributed collaboration.

Contribution

It introduces ILCIANN, a novel annotation tool designed for POS tagging and corpus management across multiple Indian languages and institutions.

Findings

01

Successful creation of large parallel corpora for 12 Indian languages

02

Effective management of annotation process across diverse users and locations

03

Enhanced consistency and standards in corpus annotation

Abstract

This paper presents the challenges in creating and managing large parallel corpora of 12 major Indian languages (which is soon to be extended to 23 languages) as part of a major consortium project funded by the Department of Information Technology (DIT), Govt. of India, and running parallel in 10 different universities of India. In order to efficiently manage the process of creation and dissemination of these huge corpora, the web-based (with a reduced stand-alone version also) annotation tool ILCIANN (Indian Languages Corpora Initiative Annotation Tool) has been developed. It was primarily developed for the POS annotation as well as the management of the corpus annotation by people with differing amount of competence and at locations physically situated far apart. In order to maintain consistency and standards in the creation of the corpora, it was necessary that everyone works on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification