Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs
Abhinav Jangda, Arjun Guha

TL;DR
This paper introduces a novel GPU execution approach for image processing pipelines that fuses loops, employs hybrid tiling, and automates loop fusion, resulting in significantly faster code than existing methods.
Contribution
It presents a new warp-sized overlapped tiling and hybrid tiling technique, along with an automatic loop fusion algorithm, improving GPU performance for image processing.
Findings
Achieves 1.65x speedup over Halide on GTX 1080Ti
Achieves 1.33x speedup over Halide on Tesla V100
Reduces shared memory usage and synchronization overhead
Abstract
Domain-specific languages that execute image processing pipelineson GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have limitations: 1) they require intra thread block synchronization, which has a non-trivial cost, 2) they must choose between small tiles that require more overlapped computations or large tiles that increase shared memory access (and lowers occupancy), and 3) their autoscheduling algorithms use simplified GPU models that can result in inefficient global memory accesses. We present a new approach for executing image processing pipelines on GPUs that addresses these limitations as follows. 1) We fuse loops to form overlapped tiles that fit in a single warp, which allows us to use lightweight warp synchronization. 2) We introduce hybrid tiling, which stores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
