LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation

Hailay Teklehaymanot; Dren Fazlija; Wolfgang Nejdl

arXiv:2603.22629·cs.CL·March 25, 2026

LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation

Hailay Teklehaymanot, Dren Fazlija, Wolfgang Nejdl

PDF

Open Access

TL;DR

This paper introduces LGSE, a method for initializing subword embeddings using morphological information to improve low-resource language adaptation in NLP tasks.

Contribution

LGSE leverages morphologically informed segmentation and regularization during pretraining to enhance embedding quality for low-resource, morphologically rich languages.

Findings

01

LGSE outperforms baseline methods on multiple NLP tasks.

02

Morphologically grounded initialization improves embedding quality.

03

Effective for languages with available morphological segmentation resources.

Abstract

Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification