Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch,, Prateek Jain, Anurag Arnab, Sujoy Paul

TL;DR
The paper introduces Mixture of Nested Experts (MoNE), a dynamic, nested expert framework that reduces computational costs in visual processing by adaptively prioritizing tokens, achieving similar accuracy with over twice the efficiency.
Contribution
MoNE presents a novel nested expert structure that adaptively processes tokens based on compute budgets, improving efficiency without sacrificing performance.
Findings
Over two-fold reduction in inference compute time.
Maintains strong performance across different compute budgets.
Validated on image and video datasets like ImageNet-21K, Kinetics400, and Something-Something-v2.
Abstract
The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Neural Networks and Applications · Urban Planning and Valuation
MethodsAttention Is All You Need · Label Smoothing · Adam · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention
