An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them
Creston Brooks, Johannes Haubold, Charlie Cowen-Breen, Jay White,, Desmond DeVaul, Frederick Riemenschneider, Karthik Narasimhan, Barbara, Graziosi

TL;DR
This paper introduces the first dataset of real errors in premodern Greek texts, enabling evaluation of error detection methods on genuine historical errors, and proposes a new discriminator-based detection approach that outperforms existing methods.
Contribution
It provides a novel annotated dataset of authentic premodern Greek errors and develops a new error detection method that improves detection accuracy over previous techniques.
Findings
Discriminator-based detector outperforms other methods by 5% in true positive rate.
Scribal errors are harder to detect than print or digitization errors.
The dataset serves as a benchmark for future error detection research in premodern texts.
Abstract
As premodern texts are passed down over centuries, errors inevitably accrue. These errors can be challenging to identify, as some have survived undetected for so long precisely because they are so elusive. While prior work has evaluated error detection methods on artificially-generated errors, we introduce the first dataset of real errors in premodern Greek, enabling the evaluation of error detection methods on errors that genuinely accumulated at some stage in the centuries-long copying process. To create this dataset, we use metrics derived from BERT conditionals to sample 1,000 words more likely to contain errors, which are then annotated and labeled by a domain expert as errors or not. We then propose and evaluate new error detection methods and find that our discriminator-based detector outperforms all other methods, improving the true positive rate for classifying real errors by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsClassical Antiquity Studies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Dense Connections · WordPiece · Residual Connection · Linear Warmup With Linear Decay · Dropout
