How much is said in a microblog? A multilingual inquiry based on Weibo and Twitter
Han-Teng Liao, King-wa Fu, Scott A. Hale

TL;DR
This study compares the amount of information and text length in microblog posts across English, Chinese, and Japanese on Weibo and Twitter, revealing language-specific differences and implications for platform design.
Contribution
It introduces a multilingual framework for quantifying information content in microblogs using parallel corpora and analyzes cross-linguistic differences in microblog text.
Findings
Languages with larger character sets contain more information per character.
Information content varies by organization type and language.
Chinese and Japanese microblogs carry more information per character than English.
Abstract
This paper presents a multilingual study on, per single post of microblog text, (a) how much can be said, (b) how much is written in terms of characters and bytes, and (c) how much is said in terms of information content in posts by different organizations in different languages. Focusing on three different languages (English, Chinese, and Japanese), this research analyses Weibo and Twitter accounts of major embassies and news agencies. We first establish our criterion for quantifying "how much can be said" in a digital text based on the openly available Universal Declaration of Human Rights and the translated subtitles from TED talks. These parallel corpora allow us to determine the number of characters and bits needed to represent the same content in different languages and character encodings. We then derive the amount of information that is actually contained in microblog posts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Media and Politics · Digital Communication and Language · Hate Speech and Cyberbullying Detection
