DEPT: Decoupled Embeddings for Pre-training Language Models

Alex Iacob; Lorenzo Sani; Meghdad Kurmanji; William F. Shen; Xinchi; Qiu; Dongqi Cai; Yan Gao; Nicholas D. Lane

arXiv:2410.05021·cs.LG·April 8, 2025

DEPT: Decoupled Embeddings for Pre-training Language Models

Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, William F. Shen, Xinchi, Qiu, Dongqi Cai, Yan Gao, Nicholas D. Lane

PDF

Open Access 1 Video 3 Reviews

TL;DR

DEPT introduces a decoupled embedding framework for pre-training language models that reduces communication costs, enhances robustness to data heterogeneity, and allows custom vocabularies, demonstrated at billion-scale federated settings.

Contribution

The paper presents a novel decoupled embedding approach enabling vocabulary-agnostic federated pre-training with significant efficiency gains and improved model performance.

Findings

01

Reduces communication costs by orders of magnitude.

02

Cuts embedding memory by 4-5 times.

03

Improves perplexity and downstream task performance.

Abstract

Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the ``curse of multilinguality''. To address these challenges we propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources without requiring a shared vocabulary. DEPT can: (1) train robustly and effectively under significant data heterogeneity, (2) minimize token embedding parameters to only what the data source vocabulary requires, while cutting communication costs in direct proportion to both the communication frequency and the reduction…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 4

Strengths

1. The authors tackle an important problem. The usage of data mixtures during pre-training is not well understood but is an essential part of modern foundation models. 2. While the idea of using model averaging after an inner loop of training on dedicated subsets of data is not particularly novel, it might have a big impact on pre-training, given the encouraging results.

Weaknesses

1. Writing can be improved or misses important information. For example, for the experimental setup, I struggle to understand L205-311, and information on software/hardware, such as how many FLOPS or hours training took, is missing. 2. Some claims are overstated: M-T outperforms DEPT in 5/11 datasets in Table 2. I am not convinced that Trim and Glob perform identically (L377). 3. An important additional baseline would be models trained on individual data sets. This would give insights into the

Reviewer 02Rating 8Confidence 5

Strengths

- The setup proposed in this paper looks very satisfying, and it seems to solve several problems both in the industry and in research labs. - The value proposition seems clear to me. - The deployed methodology appears novel. - The literature research looks satisfactory to me, given the scope of the paper.

Weaknesses

- [addressed] The paper's form is well below the required writing standards. To address this, I'd suggest specific improvements, such as: - Standardizing method names throughout the paper and tables (SPED vs SPEC, GlOB vs GLOB vs Glob, ...) - Clearly defining the performance metrics used and specifying explicitly whether lower or higher values are better - Adding a reference to Table 1 in the main text - Improving table readability by adding summary statistics (averages...), using bold o

Reviewer 03Rating 8Confidence 4

Strengths

- The paper is well-written and easy to follow. - The idea of decoupling embedding matrix and transformer block in pre-training within the federated learning framework is novel. - The authors answer the raised research questions with meaningful and extensive experiments. - The results generally confirm that DEPT can improve the generalization and plasticity of the models.

Weaknesses

- The data sources are not always clear given a dataset. The proposed pipeline only works if the domains are known. Otherwise, some manual or automatic clustering has to be used to create different sets of data. - The multi-domain data is almost only in English. But for the multilingual data, the data of each language should also contain various domains. Therefore there are confounding variables. A natural question would be whether the model can generalize to the same domains across different l

Videos

DEPT: Decoupled Embeddings for Pre-training Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare