Efficient Accelerator for Dilated and Transposed Convolution with Decomposition
Kuo-Wei Chang, and Tian-Sheuan Chang

TL;DR
This paper introduces a decomposition-based hardware accelerator for dilated and transposed convolutions, significantly improving efficiency and speed on existing CNN hardware by reducing redundant computations.
Contribution
It presents a novel decomposition method that enables efficient execution of dilated and transposed convolutions on dense CNN hardware, overcoming previous design limitations.
Findings
Achieves 87.8% reduction in cycle counts
Provides 8.2x speedup over naive execution
Compatible with existing dense CNN hardware
Abstract
Hardware acceleration for dilated and transposed convolution enables real time execution of related tasks like segmentation, but current designs are specific for these convolutional types or suffer from complex control for reconfigurable designs. This paper presents a design that decomposes input or weight for dilated and transposed convolutions respectively to skip redundant computations and thus executes efficiently on existing dense CNN hardware as well. The proposed architecture can cut down 87.8\% of the cycle counts to achieve 8.2X speedup over a naive execution for the ENet case.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods1x1 Convolution · Dilated Convolution · Batch Normalization · ENet Initial Block · ENet Bottleneck · Max Pooling · ENet Dilated Bottleneck · SpatialDropout · Transposed convolution · Convolution
