TL;DR
This paper introduces VocADT, a novel adapter-based method for vocabulary adaptation in pre-trained language models, which improves multilingual performance especially for Latin-script and fragmented languages without external resources.
Contribution
We propose VocADT, a scalable adapter-based approach for vocabulary adaptation that outperforms baselines across diverse languages and tasks without relying on external embeddings.
Findings
Latin-script languages benefit the most from adaptation.
Highly fragmented languages see significant improvements.
Vocabulary adaptation remains beneficial after fine-tuning for translation.
Abstract
Vocabulary adaptation, which integrates new vocabulary into pre-trained language models, enables expansion to new languages and mitigates token over-fragmentation. However, existing approaches are limited by their reliance on heuristics or external embeddings. We propose VocADT, a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model's weights fixed. VocADT offers a flexible and scalable solution without depending on external resources or language constraints. Across 11 languages-with diverse scripts, resource availability, and fragmentation-we demonstrate that VocADT outperforms the original Mistral model and other baselines across various multilingual tasks including natural language understanding and machine translation. We find that Latin-script languages and highly…
Peer Reviews
Decision·ICLR 2025 Poster
1. The introduction of VocADT, an efficient, adapter-based approach to vocabulary adaptation without relying on external embeddings, is novel. 2. The comprehensive experimental evaluation covers a wide range of multilingual tasks and scripts across 11 languages. It compares the proposed method against three baselines to assess the benefits of vocabulary adaptation and the robustness of VocADT. 3. The paper is well-structured and clearly presented. 4. VocADT’s method for efficient adaptation with
1. Given that grouping Mixed-scripts and Cyrillic scripts had little impact on performance, it would have been more interesting to see if grouping all languages could have achieved identical results, demonstrating the language scalability of the method. 2. The auxiliary loss, intended to retain original embeddings for overlapping tokens, lacks analysis across alpha values; further exploration in non-Latin languages might have improved performance. 3. The paper evaluates the proposed method on Mi
1. The methodology is clearly explained with reference to the prior work. 2. The experiment design of this paper is well-crafted and supports the claims of the paper.
1. The motivation behind restricting the new embedding to a linear combination of original embeddings has not been explained. 2. Since the approach requires some amount of training, the authors should report the computational cost of their approach compared to the baselines. 3. The results in Appendix B shows significant difference among performance gains of different languages. The authors should perform analysis to determine why this happens. Does the amount of new pretraining data used for vo
1. The method proposed in this article allows for vocabulary adaptation by keeping the original model completely frozen and training only the adapter, which is highly efficient. 2. Experimental results demonstrate that the method presented in this paper can effectively improve the performance. 3. The article provides a detailed analysis of the MT task, offering a more accurate assessment of the model on cross-lingual tasks.
1. The article does not clearly explain or analyze the advantages of VocADT compared to heuristic-based methods and those relying on external embeddings or networks. This causes a disconnect between the claims about existing methods' shortcomings in the introduction and the proposed method. 2. In VocADT, the initialization of embeddings and the use of auxiliary loss are very similar to existing work, raising concerns about the novelty of the paper. 3. The design of the adapter in this paper as
Code & Models
- 🤗h-j-han/Mistral-7B-VocADT-50k-Latinmodel· 2 dl2 dl
- 🤗h-j-han/Mistral-7B-VocADT-50k-Mixedmodel· 3 dl3 dl
- 🤗h-j-han/Mistral-7B-VocADT-50k-Cyrillicmodel· 2 dl2 dl
- 🤗h-j-han/Llama2-7B-VocADT-50k-Latinmodel· 2 dl2 dl
- 🤗h-j-han/Llama2-7B-VocADT-50k-Mixedmodel· 3 dl3 dl
- 🤗h-j-han/Llama2-7B-VocADT-50k-Cyrillicmodel· 3 dl3 dl
- 🤗h-j-han/Mistral-7B-VocADT-50k-Allmodel· 2 dl2 dl
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
MethodsAdapter
