Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Tong Zhu; Daize Dong; Xiaoye Qu; Jiacheng Ruan; Wenliang Chen; Yu; Cheng

arXiv:2406.11256·cs.CL·June 18, 2024

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Tong Zhu, Daize Dong, Xiaoye Qu, Jiacheng Ruan, Wenliang Chen, Yu, Cheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a dynamic data mixing strategy for MoE instruction tuning that adjusts dataset sampling weights based on inter-redundancies, improving model performance across various tasks.

Contribution

It proposes the first dynamic data mixture method for MoE instruction tuning, leveraging dataset representations to optimize sampling weights adaptively.

Findings

01

Enhanced performance on downstream tasks

02

Effective reduction of dataset redundancies

03

Improved open-ended query results

Abstract

Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales. However, previous methods simply merge all training tasks (e.g. creative writing, coding, and mathematics) and apply fixed sampling weights, without considering the importance of different tasks as the model training state changes. In this way, the most helpful data cannot be effectively distinguished, leading to suboptimal model performance. To reduce the potential redundancies of datasets, we make the first attempt and propose a novel dynamic data mixture for MoE instruction tuning. Specifically, inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets. Finally, we propose to dynamically adjust the sampling weight of datasets by their inter-redundancies, thus maximizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

spico197/moe-sft
pytorchOfficial

Videos

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts· underline

Taxonomy

TopicsStatistics Education and Methodologies · Gaussian Processes and Bayesian Inference · Data Stream Mining Techniques

MethodsMixture of Experts