TL;DR
This paper presents a novel FPGA-based stencil accelerator using combined spatial and temporal blocking with OpenCL, achieving GPU-competitive performance without input size restrictions and projecting high performance on future FPGA devices.
Contribution
It introduces a new FPGA stencil acceleration method that combines spatial and temporal blocking, overcoming previous input size limitations and guided by a performance model.
Findings
Achieves up to 760 GFLOP/s on Arria 10 for 2D stencils.
Attains 375 GFLOP/s on Arria 10 for 3D stencils.
Projects up to 3.5 TFLOP/s on upcoming Stratix 10 devices.
Abstract
Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
