TL;DR
This paper introduces an analytical modeling approach to optimize loop-level configurations for CNNs on multi-core CPUs, significantly reducing data movement and improving performance over existing auto-tuning methods.
Contribution
It presents a novel analytical method for efficiently exploring the CNN optimization space, outperforming or matching state-of-the-art auto-tuning techniques.
Findings
Achieves comparable or better performance than existing libraries.
Reduces data movement through optimized loop transformations.
Provides a scalable approach for CNN optimization on multi-core CPUs.
Abstract
Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and loop permutation, are fundamental transformations to reduce data movement. However, the search space for finding the best loop-level optimization configuration is explosively large. This paper develops an analytical modeling approach for finding the best loop-level optimization configuration for CNNs on multi-core CPUs. Experimental evaluation shows that this approach achieves comparable or better performance than state-of-the-art libraries and auto-tuning based optimizers for CNNs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
