Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation
Akhil Rajeev P, Annarao Kulkarni

TL;DR
Naamah introduces a large-scale, high-quality Sanskrit NER dataset created through DBpedia seeding and LLM generation, enabling improved training and benchmarking of NLP models for classical Sanskrit.
Contribution
The paper presents a novel methodology combining DBpedia entity extraction with LLMs to generate a large, high-quality Sanskrit NER dataset and benchmarks transformer models on it.
Findings
Naamah dataset contains 102,942 sentences.
Benchmark results for XLM RoBERTa and IndicBERTv2 are provided.
The approach improves the quality and diversity of synthetic Sanskrit data.
Abstract
The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
