LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation
Hailay Teklehaymanot, Dren Fazlija, Wolfgang Nejdl

TL;DR
This paper introduces LGSE, a method for initializing subword embeddings using morphological information to improve low-resource language adaptation in NLP tasks.
Contribution
LGSE leverages morphologically informed segmentation and regularization during pretraining to enhance embedding quality for low-resource, morphologically rich languages.
Findings
LGSE outperforms baseline methods on multiple NLP tasks.
Morphologically grounded initialization improves embedding quality.
Effective for languages with available morphological segmentation resources.
Abstract
Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
