MoIN: Mixture of Introvert Experts to Upcycle an LLM

Ajinkya Tejankar; KL Navaneet; Ujjawal Panchal; Kossar Pourahmadi; and; Hamed Pirsiavash

arXiv:2410.09687·cs.LG·October 15, 2024

MoIN: Mixture of Introvert Experts to Upcycle an LLM

Ajinkya Tejankar, KL Navaneet, Ujjawal Panchal, Kossar Pourahmadi, and, Hamed Pirsiavash

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MoIN, a method to upcycle large language models by training lightweight, semantically specialized experts that are selectively loaded during inference, enabling efficient parallel training and inference without full-model retraining.

Contribution

The paper proposes a novel 'introvert' expert approach that isolates experts for specific data subsets, reducing training and inference complexity compared to traditional MoE models.

Findings

01

Experts can be trained in parallel without communication overhead.

02

Inference is highly parallelizable by distributing experts across GPUs.

03

The proof-of-concept demonstrates the approach's validity.

Abstract

The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them "introvert" experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

The paper introduces a novel way to upcycle the language models. The new structure of mixed-of-experts allows for efficient parallel inference and training, which provides a framework that could streamline LLM deployments across multiple devices or settings.

Weaknesses

The improvement in language model performance is not obvious to me. In particular, Table 2 shows that MoIN-5k does not consistently outperform TinyLlama-2.5T across most downstream tasks, despite both being trained on the same amount of tokens. This suggests that the main contribution of MoIN may currently lie more in reducing training and inference costs through parallelism rather than in enhancing the model's performance on language tasks.

Reviewer 02Rating 6Confidence 2

Strengths

1. The writing in the paper is clear and accessible, allowing me to easily understand the implementation of the method and the execution of the experiments. 2. The authors highlight a critical issue within the current MoE framework: since routing varies by token, all experts must reside in GPU memory during both training and inference. This requirement poses challenges for scaling to a large number of experts. I believe this paper could inspire the community to explore alternative strategies f

Weaknesses

My main concern is that the experiments conducted are insufficient to support the main conclusions of the paper. The authors only evaluate a limited number of models and datasets. Including more diverse and realistic datasets and models would strengthen the findings and provide a more comprehensive evaluation of the proposed method.

Reviewer 03Rating 3Confidence 4

Strengths

The authors trained and deployed up to ~5000 independent LoRA adapters, exceeding the number of LoRAs used in previous research. The inclusion of diagrams and tables with qualitative examples enhances the reader's understanding of the approach.

Weaknesses

Motivation and Method: 1. The paper's efficiency claims primarily stem from the use of LoRA, rather than from any novel contribution of this work. 2. The deployment of thousands of LoRA adapters potentially undermines these efficiency claims, especially in inference stage. A more rigorous analysis comparing the computational requirements of this approach to traditional methods is necessary to substantiate these assertions. Experimental Design and Results: 3. Building upon TinyLlama-2T instea

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Data Mining Algorithms and Applications

MethodsAdapter · Balanced Selection