Fast Feedforward Networks
Peter Belcak, Roger Wattenhofer

TL;DR
The paper introduces Fast Feedforward (FFF) networks, a novel architecture that significantly reduces inference time and computational cost, outperforming traditional feedforward and mixture-of-experts networks while maintaining high predictive accuracy.
Contribution
The paper presents FFF, a new architecture that breaks the linear layer-size inference cost link, achieving log-time inference and high efficiency in vision transformers.
Findings
FFF are up to 220x faster than traditional feedforward networks.
FFF outperform mixture-of-experts networks by up to 6x in speed.
Using only 1% of neurons, FFF retains 94.2% of predictive performance.
Abstract
We break the linear link between the layer size and its inference cost by introducing the fast feedforward (FFF) architecture, a log-time alternative to feedforward networks. We demonstrate that FFFs are up to 220x faster than feedforward networks, up to 6x faster than mixture-of-experts networks, and exhibit better training properties than mixtures of experts thanks to noiseless conditional execution. Pushing FFFs to the limit, we show that they can use as little as 1% of layer neurons for inference in vision transformers while preserving 94.2% of predictive performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Brain Tumor Detection and Classification · CCD and CMOS Imaging Sensors
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · Fast Feedforward Networks
