Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts
Paulo J. N. Pinto, Armando J. Pinho, Diogo Pratas

TL;DR
This paper develops interpretable machine learning models using diverse linguistic features to accurately date historical texts across five centuries, providing insights into linguistic evolution and domain adaptation challenges.
Contribution
It introduces a multi-feature, tree-based approach for temporal text classification that outperforms baseline models and offers explainability through SHAP analysis.
Findings
Achieves 76.7% century-level accuracy and 26.1% decade-level accuracy.
Feature importance analysis highlights distance and lexical structure as most informative.
Cross-dataset evaluation shows domain adaptation challenges with accuracy dropping by 26.4%.
Abstract
Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Computational and Text Analysis Methods · Authorship Attribution and Profiling
