ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

Aryan Karmore

arXiv:2603.06746·cs.CV·March 10, 2026

ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

Aryan Karmore

PDF

Open Access

TL;DR

ButterflyViT introduces a novel expert compression method for Vision Transformers that significantly reduces memory requirements by geometrically reorienting shared parameters, enabling edge deployment with minimal accuracy loss.

Contribution

The paper proposes ButterflyViT, a geometric expert compression technique that achieves sub-linear memory scaling for Vision Transformers, addressing the linear memory bottleneck on edge devices.

Findings

01

Achieves 354× memory reduction on CIFAR-100 with 64 experts.

02

Maintains negligible accuracy loss despite significant compression.

03

Enables multiple experts to operate on memory-constrained edge devices.

Abstract

Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores $N$ independent expert weight matrices requiring $O (N_{E} \cdot d^{2})$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $O (d_{model} \cdot d_{ff} + N_{E} \cdot n_{ℓ} \cdot d)$ memory which is sub-linear in the number…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Stochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices