Mixture of Mini Experts: Overcoming the Linear Layer Bottleneck in Multiple Instance Learning
Daniel Shao, Joel Runevic, Richard J. Chen, Drew F.K. Williamson, Ahrong Kim, Andrew H. Song, Faisal Mahmood

TL;DR
This paper introduces MAMMOTH, a multi-head mixture of experts module that enhances multiple instance learning models by replacing the linear layer with a task-specific, low-rank transformation, significantly improving classification performance across various tasks.
Contribution
The paper proposes MAMMOTH, a parameter-efficient mixture of experts module that overcomes the linear layer bottleneck in MIL, leading to substantial performance gains across multiple models and tasks.
Findings
MAMMOTH improves performance in 130 of 152 configurations.
Task-specific transformation has a larger impact than aggregation method.
Simple pooling methods outperform complex ones with MAMMOTH.
Abstract
Multiple Instance Learning (MIL) is the predominant framework for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch's phenotype, yielding synergistic effects with any of the existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head…
Peer Reviews
Decision·ICLR 2026 Poster
- Broad, careful empirical validation across 8 MIL methods and 19 tasks (morphology and biomarkers), with consistent improvements in 130/152 configurations and larger gains on morphological tasks. - Ablations isolate the contributions of heads, slots, and low-rank experts; interpretability analyses show morphologically coherent routing/specialization; runtime/data-efficiency comparisons suggest favorable trade-offs versus sparse MoE variants. - Solid training details (optimizer/schedule/regulari
The paper has limitations in its positioning and comparative baselines, which affect the rigor and credibility of its conclusions. The main issues are as follows: 1. Lack of direct comparison with strong, relevant baselines: The proposed module, characterized as “feature re-embedding” or task-layer replacement prior to aggregation, does not include direct comparisons with recent, closely related works in WSI MIL, weakening the claims of its relative importance. Key baselines include: - Feature
1. Originality: The paper highlights a neglected component in the MIL pipeline: the task-specific linear layer. By replacing it with MAMMOTH, a parameter-efficient multihead mixture-of-experts module, the authors introduce a novel perspective. This reframes the performance bottleneck in WSI classification, making the contribution both original and conceptually impactful. 2. Quality: The work is technically rigorous. The authors design MAMMOTH with innovations like soft expert assignment, low-ra
1. Fixed hyperparameter configuration. MAMMOTH is evaluated with a fixed number of experts, heads, and slots across tasks. While effective, this may not reflect task-specific optimal configurations. Explore adaptive mechanisms to dynamically adjust expert/head/slot counts depending on task complexity or data size would be better. 2. Limited scope of tasks. Although MAMMOTH was validated on 19 tasks spanning morphology and biomarker prediction, the paper does not extend to other clinically criti
1. Novelty: There is good methodological novelty in the context of WSI classification / MIL, as the specifics of the MoE mechanism are original. 2. Clarity: The method is clearly described and the paper well-positioned w.r.t. the literature 3. Significance: Slot-based pooling could be beneficial to interpretability, by reducing bag size and allowing each patch embedding in the initial bag to be linked to its most similar morphological concept. 4. Quality: There is a thorough ablation study +
One concern with this paper is that the proposed MoE layer adds complexity to the MIL strategy, despite careful control of parameter efficiency. This needs to be justified by strong and robust performance gains. For some datasets where no mention is made of cross-validation, e.g. BRACS and PANDA, the exact experimental setup is still unclear to me: is it a single train/test split, 1 run, 1000 bootstraps of the test set on this single run? As very large differences can occur across multiple run
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Digital Imaging for Blood Diseases · Cell Image Analysis Techniques
