MatMamba: A Matryoshka State Space Model

Abhinav Shukla; Sai Vemprala; Aditya Kusupati; Ashish Kapoor

arXiv:2410.06718·cs.LG·October 10, 2024

MatMamba: A Matryoshka State Space Model

Abhinav Shukla, Sai Vemprala, Aditya Kusupati, Ashish Kapoor

PDF

Open Access 1 Repo 3 Reviews

TL;DR

MatMamba introduces a nested state space model that combines Matryoshka-style learning with Mamba2, enabling efficient, scalable, and adaptive language and image models with improved inference efficiency.

Contribution

This work presents MatMamba, a novel nested state space model that allows joint training of multiple model sizes within a single framework, improving efficiency and scalability.

Findings

01

Scales comparably to Transformers on ImageNet and FineWeb.

02

Enables training of models from 35M to 1.4B parameters.

03

Achieves more efficient inference than baseline models.

Abstract

State Space Models (SSMs) like Mamba2 are a promising alternative to Transformers, with faster theoretical training and inference times -- especially for long context lengths. Recent work on Matryoshka Representation Learning -- and its application to Transformer backbones in works like MatFormer -- showed how to introduce nested granularities of smaller submodels in one universal elastic model. In this work, we present MatMamba: a state space model which combines Matryoshka-style learning with Mamba2, by modifying the block to contain nested dimensions to enable joint training and adaptive inference. MatMamba allows for efficient and adaptive deployment across various model sizes. We train a single large MatMamba model and are able to get a number of smaller nested models for free -- while maintaining or improving upon the performance of a baseline smaller model trained from scratch.…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

- This work validates the scalability of Matryoshka Representation Learning by demonstrating its applicability to state-space models (SSMs). It considers both Mamba-vision and Mamba-language models, confirming effectiveness across both tasks. - The methodology and problem setting are well-established and well-motivated.

Weaknesses

- For the MatMamba-LM model family, only evaluation loss is reported. The submission would be strengthened by including evaluations on downstream tasks as well.

Reviewer 02Rating 5Confidence 3

Strengths

(1) Adaptive Inference Resource: MatMamba allows the extraction of multiple nested submodels from one single model, enabling versatile deployment options, from edge devices to cloud settings, without the need for retraining. By using the Mix’n’Match approach, MatMamba optimizes for various performance-compute trade-offs, allowing resource allocation adjustments based on current needs. (2) Scalability and Performance: The model exhibits scalability on par with transformers and baseline Mamba2 mod

Weaknesses

(1) Dependency on Explicitly Trained Granularities for Optimal Performance: While MatMamba’s Mix’n’Match approach offers flexibility, untrained granularities do not perform as effectively as explicitly optimized ones, showing degradation in performance and accuracy. (2) Limited Exploration of Self-Distillation Techniques: The paper mentions that self-distillation or other techniques could further enhance interpolation accuracy in untrained granularities, suggesting room for improvement in the tr

Reviewer 03Rating 6Confidence 4

Strengths

* The paper is very well-written, with the main ideas and results presented clearly. It was a pleasure to read. * Elastic networks have shown significant promise for Transformer networks, since they can be used to generate tailored networks for specific deployment scenarios at no additional training cost (once elastic training is complete). The paper makes a contribution in this relevant and promising direction. * Evaluation and results across vision and language looks strong overall.

Weaknesses

* There doesn’t seem to be a systematic way of selecting smaller sub-networks given a deployment constraint. For a given parameter or latency budget, should all dimensions be proportionally reduced? How is this proportion decided? * Novelty: this work appears to be a straightforward application of Matryoshka principles to the SSM domain; apart from the obvious changes needed for SSMs, the training algorithm remains nearly identical to the Matformer work. * It’s not clear if the scaling trends co

Code & Models

Repositories

scaledfoundations/matmamba
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRussia and Soviet political economy · Economic Development and Digital Transformation

MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings