A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court

Matteo Marulli; Glauco Panattoni; Marco Bertini

arXiv:2505.08439·cs.CL·May 14, 2025

A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court

Matteo Marulli, Glauco Panattoni, Marco Bertini

PDF

TL;DR

This paper presents a comprehensive document processing pipeline that creates an anonymized dataset from Italian Supreme Court judgments, enabling effective topic modeling and analysis of legal themes.

Contribution

The authors developed an integrated pipeline combining document layout analysis, OCR, and anonymization to produce a dataset optimized for legal topic modeling, filling a critical data gap.

Findings

01

DLA module achieved mAP@50 of 0.964

02

OCR detector reached mAP@50-95 of 0.9022

03

Dataset improved topic modeling diversity and coherence scores

Abstract

Topic modeling in Italian legal research is hindered by the lack of public datasets, limiting the analysis of legal themes in Supreme Court judgments. To address this, we developed a document processing pipeline that produces an anonymized dataset optimized for topic modeling. The pipeline integrates document layout analysis (YOLOv8x), optical character recognition, and text anonymization. The DLA module achieved a mAP@50 of 0.964 and a mAP@50-95 of 0.800. The OCR detector reached a mAP@50-95 of 0.9022, and the text recognizer (TrOCR) obtained a character error rate of 0.0047 and a word error rate of 0.0248. Compared to OCR-only methods, our dataset improved topic modeling with a diversity score of 0.6198 and a coherence score of 0.6638. We applied BERTopic to extract topics and used large language models to generate labels and summaries. Outputs were evaluated against domain expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDeep Layer Aggregation