Text clustering applied to data augmentation in legal contexts

Lucas Jos\'e Gon\c{c}alves Freitas; Tha\'is Rodrigues; Guilherme; Rodrigues; Pamella Edokawa; Ariane Farias

arXiv:2404.08683·cs.CL·April 16, 2024·1 cites

Text clustering applied to data augmentation in legal contexts

Lucas Jos\'e Gon\c{c}alves Freitas, Tha\'is Rodrigues, Guilherme, Rodrigues, Pamella Edokawa, Ariane Farias

PDF

Open Access

TL;DR

This paper demonstrates how clustering-based data augmentation using NLP techniques significantly improves legal text classification accuracy, especially for unclassified texts, with notable performance gains in SDG-related datasets.

Contribution

It introduces a clustering-based data augmentation method tailored for legal texts, enhancing classification performance without extensive manual labeling.

Findings

01

Over 15% accuracy improvement in SDG classification

02

Example base expanded by a factor of 5

03

Effective augmentation for unclassified legal texts

Abstract

Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management

MethodsBalanced Selection