Dynamic sparsity in tree-structured feed-forward layers at scale
Reza Sedghi, Robin Schiewer, Anand Subramoney, David Kappel

TL;DR
This paper introduces tree-structured sparse feed-forward layers in transformers, enabling scalable, conditional computation that activates less than 5% of units per token while maintaining performance.
Contribution
It demonstrates for the first time that tree-structured sparsity can be effectively applied to autoregressive language models beyond 1B parameters, matching dense baselines.
Findings
Models activate less than 5% of units per token.
Tree-structured sparsity matches dense model performance.
Emergent auto-pruning reduces unused paths over training.
Abstract
At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
