Probing the statistical properties of unknown texts: application to the Voynich Manuscript
Diego R. Amancio, Eduardo G. Altmann, Diego Rybski, Osvaldo N., Oliveira Jr., Luciano da F. Costa

TL;DR
This paper introduces a statistical framework to analyze unknown texts, like the Voynich Manuscript, determining their language compatibility and identifying key words without understanding their meaning.
Contribution
It proposes a novel multi-faceted statistical approach to assess the natural language properties of texts, applicable even to undeciphered manuscripts.
Findings
Voynich Manuscript is compatible with natural languages
Statistical measurements can distinguish real texts from shuffled versions
Identified candidate key-words for the Voynich Manuscript
Abstract
While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed investigating the properties of statistical measurements across different languages and texts. In this study we propose a framework that aims at determining if a text is compatible with a natural language and which languages are closest to it, without any knowledge of the meaning of the words. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing text, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
