Dynamic sparsity in tree-structured feed-forward layers at scale

Reza Sedghi; Robin Schiewer; Anand Subramoney; David Kappel

arXiv:2604.08565·cs.CL·April 13, 2026

Dynamic sparsity in tree-structured feed-forward layers at scale

Reza Sedghi, Robin Schiewer, Anand Subramoney, David Kappel

PDF

TL;DR

This paper introduces tree-structured sparse feed-forward layers in transformers, enabling scalable, conditional computation that activates less than 5% of units per token while maintaining performance.

Contribution

It demonstrates for the first time that tree-structured sparsity can be effectively applied to autoregressive language models beyond 1B parameters, matching dense baselines.

Findings

01

Models activate less than 5% of units per token.

02

Tree-structured sparsity matches dense model performance.

03

Emergent auto-pruning reduces unused paths over training.

Abstract

At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.