Beyond Film Subtitles: Is YouTube the Best Approximation of Spoken Vocabulary?
Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro, Ozaki, Maria Angelica Riera Machin, Hongyu Sun, Justin Vasselli, Taro, Watanabe

TL;DR
This study shows that YouTube subtitles can effectively approximate spoken vocabulary frequencies across multiple languages, often outperforming traditional film subtitle resources and even large language models in predicting lexical complexity.
Contribution
It introduces a method to derive frequency norms from YouTube subtitles for diverse languages, providing a new resource for psycholinguistic research especially where traditional corpora are unavailable.
Findings
YouTube-derived frequencies correlate strongly with lexical decision times and word familiarity.
Linear regression models using YouTube frequencies outperform those using film subtitles and GPT-4 in predicting lexical complexity.
The approach is applicable to languages lacking high-quality speech or subtitle corpora.
Abstract
Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that frequencies extracted from carefully processed YouTube subtitles provide an approximation comparable to, and often better than, the best currently available resources. Moreover, they are available for languages for which a high-quality subtitle or speech corpus does not exist. We use YouTube subtitles to construct frequency norms for five diverse languages, Chinese, English, Indonesian, Japanese, and Spanish, and evaluate their correlation with lexical decision time, word familiarity, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Second Language Acquisition and Learning
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
