A Small Claims Court for the NLP: Judging Legal Text Classification   Strategies With Small Datasets

Mariana Yukari Noguti; Edduardo Vellasques; Luiz Eduardo Soares; Oliveira

arXiv:2409.05972·cs.CL·September 11, 2024

A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets

Mariana Yukari Noguti, Edduardo Vellasques, Luiz Eduardo Soares, Oliveira

PDF

TL;DR

This study evaluates various NLP strategies for legal text classification with limited labeled data, finding that semi-supervised learning with BERT and data augmentation yields the best accuracy in a low-resource legal domain.

Contribution

It demonstrates the effectiveness of semi-supervised learning and data augmentation techniques, particularly UDA with BERT, for legal text classification with small datasets.

Findings

01

Unsupervised Data Augmentation (UDA) achieved 80.7% accuracy.

02

Classical models like SVM and logistic regression performed well with word2vec embeddings.

03

BERT combined with semi-supervised strategies outperformed other models.

Abstract

Recent advances in language modelling has significantly decreased the need of labelled data in text classification tasks. Transformer-based models, pre-trained on unlabeled data, can outmatch the performance of models trained from scratch for each task. However, the amount of labelled data need to fine-tune such type of model is still considerably high for domains requiring expert-level annotators, like the legal domain. This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data and perform a classification task in the legal area with 50 predefined topics. More specifically, we use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects, which currently demands deep legal knowledge for manual filling. The task of optimizing the performance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Layer Normalization · Dropout · Attention Is All You Need · WordPiece · Residual Connection · Attention Dropout · Linear Layer · Multi-Head Attention