Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?
Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev, Russell Littman

TL;DR
This paper introduces MoE-ViT, a sparse mixture-of-experts approach for multi-channel vision transformers that reduces computational costs by selectively modeling channel interactions, maintaining or improving performance.
Contribution
It proposes a novel MoE-based architecture for multi-channel ViTs that efficiently models only essential channel interactions, addressing computational bottlenecks.
Findings
Achieves significant efficiency improvements in multi-channel ViTs.
Maintains or improves performance on real-world datasets.
Reduces FLOPs and training costs without sacrificing accuracy.
Abstract
Vision Transformers () have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: "Is it necessary to model all channel interactions?". Inspired by the philosophy of Sparse Mixture-of-Experts (),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Cell Image Analysis Techniques · Domain Adaptation and Few-Shot Learning
