An Inclusive Notion of Text
Ilia Kuznetsov, Iryna Gurevych

TL;DR
This paper proposes a comprehensive framework and terminology to better understand and categorize the diverse notions of text in NLP, aiming to improve reproducibility and generalizability.
Contribution
It introduces a two-tier taxonomy of linguistic and non-linguistic elements in text and surveys existing work extending the concept of text beyond traditional views.
Findings
Taxonomy clarifies different types of textual data used in NLP.
Survey highlights diverse approaches to inclusive text in NLP.
Community reporting is essential for consolidating the inclusive notion of text.
Abstract
Natural language processing (NLP) researchers develop models of grammar, meaning and communication based on written text. Due to task and data differences, what is considered text can vary substantially across studies. A conceptual framework for systematically capturing these differences is lacking. We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. Towards that goal, we propose common terminology to discuss the production and transformation of textual data, and introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling. We apply this taxonomy to survey existing work that extends the notion of text beyond the conservative language-centered view. We outline key desiderata and challenges of the emerging inclusive approach to text in NLP, and suggest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Translation Studies and Practices
