Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages
Felix Schneider, Maria Gogolev, Sven Sickert, Joachim Denzler

TL;DR
This paper introduces Rich Character Embeddings (RCE), a transformer-based character-level word representation method that captures morphological and orthographic features, improving NLP performance in low-resource and morphologically complex languages.
Contribution
It proposes a novel hybrid transformer-convolutional model for character embeddings that outperform traditional token-based methods in low-resource and morphologically rich language tasks.
Findings
RCE outperforms token-based embeddings on limited data tasks.
The hybrid model improves performance on inflected language tasks.
Effective in both large and small language models.
Abstract
Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
