Automated Text Identification Using CNN and Training Dynamics

Claudiu Creanga; Liviu Petrisor Dinu

arXiv:2405.11212·cs.CL·May 21, 2024

Automated Text Identification Using CNN and Training Dynamics

Claudiu Creanga, Liviu Petrisor Dinu

PDF

Open Access

TL;DR

This paper analyzes training dynamics of CNNs on text data, revealing sample difficulty regions and showing that selective training on ambiguous samples enhances out-of-distribution generalization.

Contribution

It introduces a data mapping approach to characterize training samples and demonstrates the benefit of training on ambiguous examples for better generalization.

Findings

01

Identification of three sample difficulty regions: easy, ambiguous, hard

02

Training on ambiguous samples improves out-of-distribution performance

03

Insights into sample behavior during CNN training on text datasets

Abstract

We used Data Maps to model and characterize the AuTexTification dataset. This provides insights about the behaviour of individual samples during training across epochs (training dynamics). We characterized the samples across 3 dimensions: confidence, variability and correctness. This shows the presence of 3 regions: easy-to-learn, ambiguous and hard-to-learn examples. We used a classic CNN architecture and found out that training the model only on a subset of ambiguous examples improves the model's out-of-distribution generalization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques