Detecting Sexual Content at the Sentence Level in First Millennium Latin   Texts

Thibault Cl\'erice (ALMAnaCH; CJM)

arXiv:2309.14974·cs.CL·March 27, 2024

Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

Thibault Cl\'erice (ALMAnaCH, CJM)

PDF

Open Access 1 Repo

TL;DR

This paper explores deep learning techniques for sentence-level classification of sexual content in Latin texts, creating a new corpus and demonstrating effective model performance with insights into model interpretability.

Contribution

It introduces a novel Latin corpus for sexual semantics and evaluates deep learning models, highlighting their effectiveness and the impact of metadata and dataset size on performance.

Findings

01

Deep learning models outperform token-based searches.

02

Metadata embeddings can cause overfitting.

03

High precision and TPR achieved with HAN model.

Abstract

In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lascivaroma/seligator
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Computational and Text Analysis Methods