Pretrained Hybrids with MAD Skills

Nicholas Roberts; Samuel Guo; Zhiqi Gao; Satya Sai Srinath Namburi GNVV; Sonia Cromp; Chengjun Wu; Chengyu Duan; Frederic Sala

arXiv:2406.00894·cs.LG·October 1, 2025

Pretrained Hybrids with MAD Skills

Nicholas Roberts, Samuel Guo, Zhiqi Gao, Satya Sai Srinath Namburi GNVV, Sonia Cromp, Chengjun Wu, Chengyu Duan, Frederic Sala

PDF

Open Access 3 Reviews

TL;DR

Manticore is a framework that automates the design of hybrid language models by combining pretrained models from different architectures, enabling efficient architecture search and improved performance without training from scratch.

Contribution

It introduces a method to automatically create and fine-tune hybrid language models using pretrained components, reducing manual effort and training costs.

Findings

01

Manticore hybrids match manually designed hybrids in performance.

02

They achieve strong results on Long Range Arena tasks.

03

Hybrids improve performance on various NLP benchmarks.

Abstract

While Transformers underpin modern large language models (LMs), there is a growing list of alternative architectures with new capabilities, promises, and tradeoffs. This makes choosing the right LM architecture challenging. Recently proposed hybrid architectures seek a best-of-all-worlds approach that reaps the benefits of all architectures. Hybrid design is difficult for two reasons: it requires manual expert-driven search, and new hybrids must be trained from scratch. We propose Manticore, a framework that addresses these challenges by automating the design of hybrid architectures while reusing pretrained models to create pretrained hybrids. Our approach augments ideas from differentiable Neural Architecture Search (NAS) by incorporating simple projectors that translate features between pretrained blocks from different architectures. We then fine-tune hybrids that combine pretrained…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is well-written and the the motivation is clear and convincing. 2. The figures are well-designed and helpful. 3. Interesting results that prove the effectiveness of their framework on specific models.

Weaknesses

1. The experiments done are mostly on smaller models and it's not clear if MAD remains effective with larger models. The authors could do additional experiments with other open-source models to validate this.

Reviewer 02Rating 3Confidence 3

Strengths

+ Manticore’s approach to combining pretrained models from different architectures using projectors and mixture weights is innovative and extends beyond typical model merging methods. + Manticore’s design, which allows for fine-tuning and programming pretrained hybrids, offers a degree of flexibility, making it potentially beneficial for practitioners looking to leverage diverse model architectures. + Testing across LRA and MAD tasks provides an initial sense of the framework's potential, althou

Weaknesses

- The main claim, "Pretrained hybrids can outperform their component models on fine-tuning tasks," is not well-supported. A fair comparison would entail fine-tuning Manticore and its component models under the same budget to evaluate relative gains. Without this, it’s unclear whether the hybrid approach provides substantial benefits beyond those of individually optimized models. - Manticore requires a dedicated training process for projector layers and mixture weights, potentially adding overhea

Reviewer 03Rating 5Confidence 4

Strengths

This paper focuses on an important and relevant problem of using architectural components from different state-of-the-art model architectures to construct a hybrid model that provides the best of all worlds without incurring expensive pre-training and search space exploration overheads. It introduces a novel idea of projectors that enable different architectures to interact in each other feature space by projecting an intermediate shared feature space that acts as a translator for them.

Weaknesses

Although the idea of projectors is novel but using gating to combine the contributions of different architectures has been explored in Mixture of experts [1], weighted ensemble averaging and finds a direct use in this paper. The evaluation compares the combined hybrid that has 2x the number of parameters and >2x FLOPs due to projectors and gating against individual models of half the size. In Table 1, Mamba is already better than Pythia in all of the tasks, in Table 2, Mambaformer is also bett

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications · Model-Driven Software Engineering Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Softmax · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing