Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek
Giuseppe G. A. Celano

TL;DR
Opera Graeca Adnotata is a comprehensive, multilayer annotated corpus of over 34 million tokens of Ancient Greek texts, integrating multiple annotation layers for linguistic analysis and research.
Contribution
This work introduces the largest open-access multilayer corpus for Ancient Greek, with detailed annotations and scalable formats, enabling advanced linguistic and computational studies.
Findings
Largest open-access Ancient Greek corpus with 34M+ tokens
Seven detailed annotation layers including syntax and morphology
Rule-based and parser-based annotation methods used
Abstract
In this article, the beta version 0.1.0 of Opera Graeca Adnotata (OGA), the largest open-access multilayer corpus for Ancient Greek (AG) is presented. OGA consists of 1,687 literary works and 34M+ tokens coming from the PerseusDL and OpenGreekAndLatin GitHub repositories, which host AG texts ranging from about 800 BCE to about 250 CE. The texts have been enriched with seven annotation layers: (i) tokenization layer; (ii) sentence segmentation layer; (iii) lemmatization layer; (iv) morphological layer; (v) dependency layer; (vi) dependency function layer; (vii) Canonical Text Services (CTS) citation layer. The creation of each layer is described by highlighting the main technical and annotation-related issues encountered. Tokenization, sentence segmentation, and CTS citation are performed by rule-based algorithms, while morphosyntactic annotation is the output of the COMBO parser trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing
