TL;DR
LoopTree systematically explores an expanded fused-layer dataflow design space for DNN accelerators, enabling more efficient trade-offs between buffer capacity, energy, and latency, leading to improved accelerator designs.
Contribution
It introduces a comprehensive design space, a taxonomy, and a model for evaluating fused-layer dataflow accelerators, surpassing prior limited explorations.
Findings
Achieves up to 10× buffer capacity reduction for same off-chip transfers.
Model validated with 4% worst-case error against prior architectures.
Exploration of larger space yields more efficient accelerator designs.
Abstract
Latency and energy consumption are key metrics in the performance of deep neural network (DNN) accelerators. A significant factor contributing to latency and energy is data transfers. One method to reduce transfers or data is reusing data when multiple operations use the same data. Fused-layer accelerators reuse data across operations in different layers by retaining intermediate data in on-chip buffers, which has been shown to reduce energy consumption and latency. Moreover, the intermediate data is often tiled (i.e., broken into chunks) to reduce the on-chip buffer capacity required to reuse the data. Because on-chip buffer capacity is frequently more limited than computation units, fused-layer dataflow accelerators may also recompute certain parts of the intermediate data instead of retaining them in a buffer. Achieving efficient trade-offs between on-chip buffer capacity, off-chip…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
