Characterizing the Google Books corpus: Strong limits to inferences of   socio-cultural and linguistic evolution

Eitan Adam Pechenick; Christopher M. Danforth; Peter Sheridan Dodds

arXiv:1501.00960·physics.soc-ph·May 28, 2020

Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution

Eitan Adam Pechenick, Christopher M. Danforth, Peter Sheridan Dodds

PDF

TL;DR

This paper critically examines the limitations of the Google Books corpus for studying cultural and linguistic evolution, highlighting biases introduced by prolific authors and scientific texts that distort frequency-based inferences.

Contribution

It reveals how scientific texts and prolific authors skew Google Books data, emphasizing the need for careful characterization before using it to study cultural trends.

Findings

01

Scientific texts increasingly dominate the corpus over time.

02

Only the English Fiction dataset from the second version is minimally affected by professional texts.

03

Caution is needed when interpreting frequency trends as cultural indicators.

Abstract

It is tempting to treat frequency trends from the Google Books data sets as indicators of the "true" popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.