Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
Mohammed Elbtity, Peyton Chandarana, Ramtin Zand

TL;DR
The paper introduces Flex-TPU, a reconfigurable dataflow architecture for TPUs that dynamically adapts to different layers, significantly boosting performance over traditional fixed dataflow TPUs with minimal overhead.
Contribution
Develops the first runtime reconfigurable dataflow TPU, enabling dynamic dataflow changes per layer to optimize performance.
Findings
Achieves up to 2.75x performance improvement over conventional TPU.
Maintains minimal area and power overheads.
Validates effectiveness across multiple ML workloads.
Abstract
Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
