Thesis proposal: Are We Losing Textual Diversity to Natural Language Processing?
Josef Jon

TL;DR
This thesis investigates whether current NLP algorithms, especially Neural Machine Translation, diminish textual diversity, potentially impacting language richness, by analyzing statistical properties of texts and exploring alternative methods to preserve diversity.
Contribution
It introduces measures to quantify text diversity and examines the limitations of NMT systems in maintaining this diversity, proposing new approaches to improve global planning in translation.
Findings
NMT systems tend to reduce textual diversity compared to human translation.
Training objectives and decoding algorithms may contribute to loss of diversity.
Proposed alternative methods aim to preserve the intrinsic variability of language.
Abstract
This thesis argues that the currently widely used Natural Language Processing algorithms possibly have various limitations related to the properties of the texts they handle and produce. With the wide adoption of these tools in rapid progress, we must ask what these limitations are and what are the possible implications of integrating such tools even more deeply into our daily lives. As a testbed, we have chosen the task of Neural Machine Translation (NMT). Nevertheless, we aim for general insights and outcomes, applicable even to current Large Language Models (LLMs). We ask whether the algorithms used in NMT have inherent inductive biases that are beneficial for most types of inputs but might harm the processing of untypical texts. To explore this hypothesis, we define a set of measures to quantify text diversity based on its statistical properties, like uniformity or rhythmicity of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
