Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law
Shounak Paul, Arpan Mandal, Pawan Goyal, Saptarshi Ghosh

TL;DR
This paper explores pre-training and fine-tuning Transformer-based legal language models specifically on Indian legal texts, demonstrating improved performance across multiple legal NLP tasks and domains.
Contribution
It introduces Indian legal domain-specific pre-training of existing models and training a new model from scratch, enhancing cross-domain NLP performance.
Findings
Improved accuracy on Indian legal NLP tasks.
Enhanced performance on European and UK legal texts.
Effective explainability analysis of models.
Abstract
NLP in the legal domain has seen increasing success with the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. PLMs trained over European and US legal text are available publicly; however, legal text from other domains (countries), such as India, have a lot of distinguishing characteristics. With the rapidly increasing volume of Legal NLP applications in various countries, it has become necessary to pre-train such LMs over legal text of other countries as well. In this work, we attempt to investigate pre-training in the Indian legal domain. We re-train (continue pre-training) two popular legal PLMs, LegalBERT and CaseLawBERT, on Indian legal data, as well as train a model from scratch with a vocabulary based on Indian legal text. We apply these PLMs over three benchmark legal NLP tasks -- Legal Statute Identification from facts, Semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
