Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Matteo Di Cristofaro

arXiv:2507.01764·cs.CL·July 3, 2025

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Matteo Di Cristofaro

PDF

Open Access

TL;DR

This paper investigates how emojis and homoglyphs impact tokenisation in corpora, emphasizing the importance of preprocessing for accurate linguistic analysis and data fidelity.

Contribution

It introduces methods to improve tokenisation accuracy by addressing emojis and homoglyphs, ensuring more reliable corpus representations.

Findings

01

Preprocessing emojis and homoglyphs enhances data fidelity.

02

Accurate tokenisation is crucial for valid linguistic analysis.

03

The study supports better reproducibility in corpus research.

Abstract

Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Communication and Language · Second Language Acquisition and Learning · Linguistics, Language Diversity, and Identity