The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World
Boris Ryabko, Nadezhda Savina, Yeshewas Getachew Lulu, Yunfei Han

TL;DR
This study shows that the amount of text needed to identify a writer's style is similar across different languages, including Russian, Amharic, Chinese, and English.
Contribution
The study demonstrates that the data required for author recognition is consistent across diverse language groups.
Findings
The amount of data needed to recognize an author's style is nearly the same across four languages.
The RS-method was successfully applied to fiction texts in Russian, Amharic, Chinese, and English.
The results are relevant to computer science, literary studies, and computational linguistics.
Abstract
In this paper, we apply an information-theoretic method proposed by Ryabko and Savina (therefore called the RS-method), based on the use of data compression, to recognize the individual author’s style of a writer across four languages from different language groups and families. In this paper, the presented method was used to study fiction texts in Russian (East Slavic group of languages of the Indo-European language family), Amharic (South Ethiosemitic group of the Semitic language family), Chinese (Sinitic group of the Sino-Tibetan language family) and English (West Germanic language group of the Indo-European language family). It was found that the amount of data necessary for recognizing an author’s style is almost the same for all four languages, i.e., the amount of data is invariant across different language groups. The results obtained are of interest to computer science,…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Names, Identity, and Discrimination Research · Translation Studies and Practices
1. Introduction
The Concept of the Individual Author’s Style of a Writer
An author’s style is a unique set of features characteristic of a particular writer’s work, which makes their novels recognizable and different from those of other writers [1,2,3]. An author’s style is formed during the creative process and reflects the individuality and worldview of the author [3,4,5,6]. In fiction, the styles of different writers are extremely diverse. For example, one can recall the businesslike and laconic style of Ernest Hemingway, the fussy James Joyce, the sardonically abrupt Kurt Vonnegut [2] or the heavy and cumbersome style of Tolstoy. An author’s style is formed gradually over the course of their life and reflects the evolution of the author [4].
In our work, we used the classification of sources of text variability proposed by the leading mathematician A.N. Kolmogorov, who is also known for his results in the field of information theory [7,8]. He identified the following sources of text variability: content, form and unconscious individual author’s style. Many researchers argue that individual author’s style is a reflection of the writer’s personality [1,2,3,4]. The elements of author’s style are well known. These include vocabulary (use of certain words and expressions), syntax (features of sentence construction), tropes and figures of speech (metaphors, epithets, comparisons, etc.) and composition (arrangement of parts of the work), as well as the general tone and mood of the work [3].
Studying the elements of author’s style requires multi-tasking and is a rather difficult problem. But recognizing the author’s style of a writer without pursuing analysis of style elements is a completely different task. The information-theoretic method proposed by Ryabko & Savina [9,10,11], which we call the RS-method, helps to solve this problem reliably, i.e., with the help of the apparatus of mathematical statistics. It is based on the use of so-called archivers or data compression methods, which, in turn, can be attributed to information theory. The fact is that modern archivers are aimed at finding a variety of patterns in compressed texts, including through using methods such as describing the text with the shortest formal grammar, building dictionaries of minimal volume describing the text and other methods related to artificial intelligence.
An important application of data compression for classification was proposed by P. Vitani and developed by him and his co-authors in several papers (see [12,13] and the references therein). They used the length of a compressed message as an estimate of its Kolmogorov complexity and, based on this, proposed the so-called normalized compression distance between two different texts. This approach made it possible to classify different human languages, animal species (based on their genomes), computer and biological viruses and some other objects. The main difference between the application of normalized compression distance and our approach is the integration of the latter with methods from mathematical statistics, which makes it possible to apply the developed apparatuses of this science, including hypothesis testing and numerical measures such as Cramer’s coefficient.
In this study, we applied this method of recognizing author’s style to four languages: English, Russian, Amharic and Chinese. This choice was due to the fact that these languages belong to different language groups and families. An unexpected result was obtained: author’s style was reliably determined in texts across such different languages using the same amount of data, measured in kilobytes and not in the number of letters, symbols or similar units. Thus, we can assert that the amount of data necessary to determine the author’s style of a writer is, in a sense, invariant for all the languages we have considered.
2. RS-Method for Recognizing the Author’s Style of a Writer
2.1. The Idea of the Method
The method of recognizing an author’s style is based on the use of algorithms for lossless compression, implemented in the form of so-called archivers. Their purpose is to encode texts in such a way that the length of the encoded message is shorter than the original (the text is compressed) and, if necessary, the encoded text can be decoded into the original. Text data was fed as the input of the archiver, which encoded the text data into files of shorter length, i.e., compression. Compression occurs because archivers find unevenness in the frequencies of occurrence of letters and words and use hidden patterns based on the theory of formal grammars and the laws of information transmission. Let us briefly describe the scheme of application of the developed method. Let us define three texts, T_1_, T_2_, T_3_, and it is known that T_1_ and T_2_ were generated by different sources of information, I_1_ and I_2_, and T_3_ was generated by either I_1_ or I_2_ (for example, T_1_ is a text in English, T_2_ is in German, and T_3_ is in English or German). Let d be some archiver, and, if it is applied to some file X, then the length of the compressed file is denoted by d(X). First, the texts are combined into the pairs T_1_T_3_ and T_2_T_3_, and both pairs are compressed. Then, we separately compress files T_1_ and T_2_, after which we calculate the differences in the lengths of the compressed files: d(T_1_T_3_) − d(T_1_) and similarly d(T_2_T_3_) − d(T_2_). If d(T_1_T_3_) − d (T_1_) is less than d(T_2_T_3_) − d(T_2_), then we conclude that the text T_3_ was generated by the information source I_1_. If d(T_1_T_3_) − d(T_1_) > d(T_2_T_3_) − d(T_2_), then T_3_ was generated by the information source I_2_. This conclusion is due to the fact that the archiver, when compressing later texts, i.e., T_3_, uses the statistical features it found when compressing earlier texts, namely T_1_ or T_2_. Therefore, the text T_3_ is compressed more effectively after text with the same source of information was compressed before it. The following simple example explains the essence of this method: Let T_1_ be a text in English, T_2_ in German, and (unknown) T_3_ also in English. Then d(T_1_T_3_) − d(T_1_) will be less than d(T_2_T_3_) − d(T_2_) because, in the first case, T_3_ in T_1_T_3_ was compressed after the archiver had been “tuned” to “its” statistics (for example, in the case of texts in English and German, the method works flawlessly with text lengths of several hundred letters for T_1_, T_2_ and T_3_).
This idea was proposed by Tehan [14,15] and was further developed by Ryabko and Savina (RS-method) [9,10,11]. In particular, in [9], this idea was applied to construct a statistical method for classifying texts, allowing one to determine the reliability of the obtained conclusions using mathematical statistics methods. The described scheme was also successfully applied by the authors of this paper to solve problems of text attribution in works [10], where it was experimentally shown that the individual style of an author can be determined quite accurately based on 4 KB of their text (approximately two pages of text in Russian or English). Based on this fact, we will apply the same scheme to solve the problem of recognizing the author’s style of writers of different language groups.
2.2. Description of the RS-Method for Recognizing the Author’s Style of a Writer
In order to make the description more understandable, we will illustrate it with an example of constructing a method for determining the author’s style of English-language writers. Let N writers and their works T_1_, T_2_, …, T_N_ be given.
Each text T_i_ is represented as two samples, called training (X_i_, i = 1, …, N) and experimental, which, in turn, consist of M parts (slices), which we will denote by Y_ij_, i = 1, …, N; j = 1, …, M.
For the experimental work, we compiled a sample of texts from Beresford, Jerome, Defoe and Locke, N = 4, M = 16. From the works of these authors, we made 4 training samples X_1_, …, X_4_, each 64 KB in size. Then we made test samples—16 files Y_1j_, j = 1, …, 16, each 4 KB in size, from the works of Beresford, Y_2j_, j = 1, …, 16, from the works of Defoe, and …, Y_16j_, j = 1, …, 16, from the works of Jerome and Locke. Then the file Y_1,1_ was successively “compressed” with the training samples of the sample X_1_, …,X_4_ and it was determined which of them was “better” compressed (i.e., d(X_1_ Y_1,1_) − d(X_1_), …, d(X_4_ Y_1,1_) − d(X_4_) were calculated and i was found, for which d(X_i_ Y_1,1_) − d(X_i_) is minimal). All Y_ij_, i = 1,…, 4; j = 1, …, 16, were processed similarly.
Table 1 presents the obtained data for the LZMA archiver, with a training sample (X_i_) of 64 kB and a slice (Y_ij_) of 4 kB.
Let us explain the meaning of these numbers: 16 in the upper left corner means that out of 16 files Y_1j_, j = 1, …, 16, all were compressed better with X_1_ (in other words, all 16 slices from Defoe’s works were compressed better with the training set of his works. The obtained result shows that D. Defoe’s author’s style is uniquely recognized by a 4 KB slice with a training set of 64 KB). The numbers from the first line mean that out of 16 files Y_2j_, j = 1, …, 16, 14 slices were compressed better with X_2_ (i.e., 14 slices from Beresford’s works were compressed better with his training set; however, 1 slice was more similar to Jerome’s works and 1 slice was similar to Locke’s works; here, the recognition of the writer’s style is 14 out of 16).
We will call the entire process of transition from the source texts T_1_, T_2_, …, T_N_ to the table (of size N × N) the construction of a contingency table, and we will denote the contingency table itself as W (T_1_, T_2_, …, T_N_) or W (depending on the context) and represent this table as follows: t_1,1_ t_1,2_ … t_1,N_W(T_1_, T_2_, …, T_N_) = t_2,1_ t_2,2_ … t_2,N_………………… t_N,1_ t_N,2_ … t_N,N_
In addition, for each W table, we calculated the value of Cramer’s coefficient V [16]); here it should be noted that V is used to assess the relationship, or interdependence, and it takes values from zero to one, and a higher value indicates a greater dependence or interrelationship.
We will explain its meaning in more detail together with the contingency table W. As we saw in the example, the numbers in the cells of the contingency table indicate the number of slices whose authorship was attributed to a specific writer. If the method works “correctly”, i.e., it correctly determines the author’s style by the slices, then the values in the table will be concentrated mainly on the main diagonal. Otherwise, when the slices do not reveal the author’s style of the writer, the values in the table will be evenly distributed among different cells related to different writers.
This effect can be quantified using the Cramer V coefficient [14], which is calculated as follows: first, calculate P = and then calculate the following: = and Cramer’s coefficient V = .
For Table 1, Cramer’s coefficient V = 0.9.
Note that the Cramer coefficient V = 1 if all nondiagonal elements are equal to 0, and V is equal to 0 if all t_i,j_ are equal.
Now let us pay attention to the choice of archiver. There are quite a lot of them at present. For this purpose, we examined the BZIP2, DEFLATE and LZMA archivers on the same sample. It turned out that the LZMA archiver has the highest Cramer coefficient; henceforth, we used this archiver. In our experiments, compression was performed using the 7-Zip archiver; the reference implementation of LZMA was developed in [17]. (We will not describe this in detail, since similar calculations were performed in [11], see 2.3. “Selection of method parameters”.)
3. Recognizing the Author’s Style of Writers in Different Language Groups
For our study, we selected 4 languages from different language groups belonging to different language families:
English (West Germanic language group of the Indo-European language family);
Amharic (Southern Ethiosemitic group of the Semitic language family);
Russian (East Slavic group of the Indo-European language family);
Chinese (Sinitic group of the Sino-Tibetan language family).
We note that we had already worked with texts in Russian and English in previous studies on determining the quality of translations and attribution of literary texts [10,11]. Therefore, we started with English. For this study, we selected the texts of the following works in English (see Table 2).
From each literary work, text pieces of 64 KB were taken for the training sample and text pieces of 64 KB for the test sample. Each sample was divided into 16 fragments (4 KB slices). Each fragment was added to the training sample in turn; the number of recognized fragments was recorded in the table. The results are presented in Table 3. The writers are presented by numbers.
The table shows that only two writers, Humphry Ward and Schreiner Olive, each had one slice attributed to the style of another writer. In George Eliot’s texts, two fragments out of 16 were attributed to Kipling. All other writers had their author styles recognized absolutely correctly: 16 slices out of 16. And the Cramer coefficient is close to 1 and equals V = 0.992.
To study the author style of Russian-speaking writers, 16 literary works by Russian writers of the late 19th–early 20th centuries were selected (see Table 4).
The preprocessing work with the Russian texts was exactly the same as with the English novels: a 64 KB sample, divided into 16 fragments of 4 slices. These 16 fragments of 4 KB were added to the training sample one by one for compression. The number of recognized slices was recorded in a table. The results are in Table 5.
The table shows a well-built diagonal consisting of recognized fragments of the author’s styles. However, Valery Bryusov’s texts were recognized in 10 fragments out of 16. This phenomenon has its own explanation. Bryusov is an outstanding Russian poet and the founder of Russian symbolism. His historical novel The Altar of Victory was the first prose work of the outstanding poet. The novel was dedicated to the Roman Empire during the era of its collapse. Apparently, his author’s style had not yet been formed; it contained many imitations and quotations. Bryusov used citations from 34 ancient poetic sources of various lyrical genres, and also accompanied the novel with notes occupying more than 100 of the 400 pages of text.
At the next stage of our research, we turned to the Chinese language. Chinese is a unique language with a rich history. It has features that are not found in other languages. As Chinese language experts note the Chinese language consists of many idioms [18,19]. An idiom is a stable figure of speech used as a single whole, forming a phraseological fusion [19]. An idiom can consist of 1–4 hieroglyphs. Each of the hieroglyphs carries its own semantic load, forming one image figure of speech [18]. An idiom is one indivisible lexical unit. Literary texts contain a large number of idioms. Idioms are written in hieroglyphs. Chinese writing is a logographic writing system in which symbols (logogram-hieroglyphs) [18,19] represent whole words or morphemes, but not individual sounds and letters [19]. Unlike phonetic writing, each hieroglyph is assigned not only a phoneme, but also a meaning, so the number of signs in Chinese writing is very large [20]. For our study, we selected literary works written in the official language, Putonghua (Mandarin) (see Table 6).
The texts were processed using the method already tested in English and Russian. We prepared a training sample of 64 KB and a test experimental sample of 64 KB. Then, for compression, 4 KB text fragments were added to the training sample, of which we selected 16. The results are presented in Table 7.
As can be seen from the table, the results are very similar to the results of the analyses of texts in Russian and English. The perfectly constructed diagonal shows the recognition of the author’s style of all writers.
The next language chosen for our study was Amharic. Amharic (አማርኛ) is the language of the Amhara people; it belongs to the Semitic family of languages [21]. For many years, Amharic was the official language of Ethiopia; now it has the status of being the working language of the government. About 25 million people in Ethiopia speak Amharic. The language is also widespread among some of the peoples of neighboring states: in Eritrea, Somalia, and Sudan [21]. It should be noted that more than 3 million emigrants speak Amharic outside of Ethiopia in the USA, Canada, Sweden and Israel. Amharic is used in business communications, in government agencies and in education. Newspapers, magazines and books are published in it. The list of literary works selected for the study is presented in Table 8.
Preliminary work with Amharic texts was the same as with other languages presented in the study. Two samples of 64 KB texts were formed: a training sample and a test experimental sample. Both samples were divided into 16 fragments of 4 KB. Then, 4 KB slices from the test sample were added one-by-one to the training sample for compression. After text compression, the results were entered into a table. The results are presented in Table 9.
The results of the study show that the RS-method of recognizing author’s style also works in the Amharic language. The Amharic language is a unique language that has a number of specific features. For example, the Amharic alphabet consists of 28 consonants and 7 vowels, but the writing system has special signs and combinations that bring the number of sounds to 200 [21]. The Amharic alphabet, also known as the Ethiopian script, is a syllabic script in which each sign represents a combination of a consonant and a vowel [22]. Despite the uniqueness and complexity of the Amharic language, Cramer’s coefficient is almost the same as that of the other languages we have considered.
4. Conclusions
The conducted study on the corpora of texts in different languages from four language groups showed that it is quite possible to determine author’s style using the RS-method. The main finding of the study (which was not known before) is the discovery of a new scientific fact: the same amount of data is required to recognize the author’s style of a writer in different languages that are culturally, historically and grammatically distant from each other. A completely natural question arises about the stability of these conclusions given different volumes of the training sample and sizes of the “slice”. It is natural to assume that with an increase in each of these parameters, and with their joint increase, the Cramer coefficient should increase. We deliberately conducted experiments on different sample sizes, similar to the one described above, and the results confirmed this assumption (see Table 10).
As can be seen from the table, the degree of change in the values of the Cramer coefficient remains approximately the same for all the languages considered, which confirms the conclusion that the amount of data required to recognize the author’s style in different languages from different language groups is almost the same or invariant.
Let us now discuss the possible applications of the developed method. Some of them are practical, while others are theoretical and even philosophical in nature.
Among the practical tasks, we will mention the detection of plagiarism and the determination of authorship. Among the theoretical tasks are issues related to artificial intelligence systems being capable of maintaining dialogue with people and/or creating texts on specific topics. An interesting question is whether different artificial intelligence systems have their own authorial style. And if so, is it possible to build an artificial intelligence system without an authorial style (or with a hidden authorial style)? Another related question is whether there is a certain level of complexity needed for a system to be capable of maintaining a dialogue with a human being, above which the system must have its own unique style. Perhaps the approach proposed here could become a tool for investigating such problems.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Holmes D.I. The Analysis of Literary Style—A Review J. R. Stat. Soc. Ser. A (Gen.)198514832834110.2307/2981893 · doi ↗
- 2Ray B. Style: An Introduction to History, Theory, Research, and Pedagogy University Press of Colorado Fort Collins, CO, USAWAC Clearinghouse Fort Collins, CO, USA 2015978-1-60235-614-6
- 3Aquilina M. The Event of Style in Literature Palgrave Macmillan London, UK 2014
- 4Can F. Patton J.M. Change of writing style with time Comput. Humanit.200438618210.1023/B:CHUM.0000009225.28847.77 · doi ↗
- 5Zheng R. Li J. Chen H. Huang Z. A framework for authorship identification of online messages: Writing-style features and classification techniques J. Am. Soc. Inf. Sci. Technol.20065737839310.1002/asi.20316 · doi ↗
- 6Nguyen T. Dinh D. An Empirical Investigation of Authorial Writing Styles Based on a Vietnamese Corpus Open J. Mod. Linguist.20211196798210.4236/ojml.2021.116075 · doi ↗
- 7Kolmogorov A.N. Three approaches to quantitative definition of information Probl. Inf. Transm.1965131110.1080/00207166808803030 · doi ↗
- 8Ryabko B. Astola J. Malyutov M. Compression-Based Methods of Statistical Analysis and Prediction of Time Series Springer New York, NY, USA 2016122130
