Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results
Matteo Di Cristofaro

TL;DR
This paper investigates how emojis and homoglyphs impact tokenisation in corpora, emphasizing the importance of preprocessing for accurate linguistic analysis and data fidelity.
Contribution
It introduces methods to improve tokenisation accuracy by addressing emojis and homoglyphs, ensuring more reliable corpus representations.
Findings
Preprocessing emojis and homoglyphs enhances data fidelity.
Accurate tokenisation is crucial for valid linguistic analysis.
The study supports better reproducibility in corpus research.
Abstract
Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Communication and Language · Second Language Acquisition and Learning · Linguistics, Language Diversity, and Identity
