Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

Felix Schneider; Maria Gogolev; Sven Sickert; Joachim Denzler

arXiv:2602.21377·cs.CL·February 26, 2026

Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

Felix Schneider, Maria Gogolev, Sven Sickert, Joachim Denzler

PDF

Open Access

TL;DR

This paper introduces Rich Character Embeddings (RCE), a transformer-based character-level word representation method that captures morphological and orthographic features, improving NLP performance in low-resource and morphologically complex languages.

Contribution

It proposes a novel hybrid transformer-convolutional model for character embeddings that outperform traditional token-based methods in low-resource and morphologically rich language tasks.

Findings

01

RCE outperforms token-based embeddings on limited data tasks.

02

The hybrid model improves performance on inflected language tasks.

03

Effective in both large and small language models.

Abstract

Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification