Natural Language Statistical Features of LSTM-generated Texts
Marco Lippi, Marcelo A Montemurro, Mirko Degli Esposti, Giampaolo, Cristadoro

TL;DR
This paper analyzes the statistical properties of texts generated by LSTM networks, comparing them to natural language and Markov models, revealing that LSTM-generated texts best mimic natural long-range correlations at an optimal temperature setting.
Contribution
It provides a comprehensive quantitative analysis of LSTM-generated language, highlighting its ability to replicate long-range correlations similar to natural language, and identifies an optimal generation parameter.
Findings
LSTM texts reproduce long-range correlations similar to natural language.
Word-frequency and entropy measures of LSTM texts are comparable to real language.
An optimal temperature parameter exists that makes LSTM texts most similar to natural language.
Abstract
Long Short-Term Memory (LSTM) networks have recently shown remarkable performance in several tasks dealing with natural language generation, such as image captioning or poetry composition. Yet, only few works have analyzed text generated by LSTMs in order to quantitatively evaluate to which extent such artificial texts resemble those generated by humans. We compared the statistical structure of LSTM-generated language to that of written natural language, and to those produced by Markov models of various orders. In particular, we characterized the statistical structure of language by assessing word-frequency statistics, long-range correlations, and entropy measures. Our main finding is that while both LSTM and Markov-generated texts can exhibit features similar to real ones in their word-frequency statistics and entropy measures, LSTM-texts are shown to reproduce long-range correlations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
