Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Rajan Agarwal, Aarush Gupta

TL;DR
This paper introduces LLINK, a novel method for improving cross-lingual performance of instruction-tuned LLMs on low-resource languages by aligning sentence embeddings without retraining the decoder or changing the tokenizer.
Contribution
LLINK is a compute-efficient approach that conditions a frozen decoder using latent language injection, enhancing cross-lingual alignment and retrieval for low-resource languages.
Findings
Significant improvement in bilingual retrieval performance.
Achieved 81.3% preference over the base model in Q&A evaluations.
Reduced tokenization inflation and strengthened cross-lingual alignment.
Abstract
Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
