Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning
Stanley Ngugi

TL;DR
This paper introduces Targeted Lexical Injection (TLI), a fine-tuning method that significantly enhances cross-lingual lexical alignment in low-resource language models by leveraging early-layer representations.
Contribution
It presents a novel, efficient fine-tuning approach using LoRA and contrastive learning to improve lexical alignment in LLMs for low-resource languages, focusing on early-layer embeddings.
Findings
TLI improves lexical similarity from 0.3211 to 0.4113 for trained pairs.
TLI enhances alignment from 0.3143 to 0.4033 for unseen pairs.
Early-layer representations exhibit near-perfect lexical alignment, guiding effective fine-tuning.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their performance in low-resource languages (LRLs), such as Swahili, often lags due to data scarcity and underrepresentation in pre-training. A key challenge is achieving robust cross-lingual lexical alignment, crucial for tasks like translation and cross-lingual information retrieval. This paper introduces Targeted Lexical Injection (TLI), a novel and efficient fine-tuning approach. We first demonstrate that Lugha-Llama-8B-wura, a Swahili-centric LLM, exhibits strong, near-perfect lexical alignment for Swahili-English word pairs in its early internal layers (specifically Layer 2, with ~0.99998 average cosine similarity based on a pilot study), a capability not fully reflected in its final output representations (baseline ~0.32 similarity on our evaluation set). TLI leverages this insight by using Low-Rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
