Thread Batching for High-performance Energy-efficient GPU Memory Design

Bing Li; Mengjie Mao; Xiaoxiao Liu; Tao Liu; Zihao Liu; Wujie Wen,; Yiran Chen; Hai (Helen) Li

arXiv:1906.05922·cs.AR·June 17, 2019

Thread Batching for High-performance Energy-efficient GPU Memory Design

Bing Li, Mengjie Mao, Xiaoxiao Liu, Tao Liu, Zihao Liu, Wujie Wen,, Yiran Chen, Hai (Helen) Li

PDF

Open Access

TL;DR

This paper introduces a novel GPU memory architecture with thread batching and scheduling techniques that significantly enhance performance and energy efficiency by improving memory access parallelism and reducing contention.

Contribution

It proposes TEMP and TBAS, innovative methods for memory partitioning and scheduling that optimize GPU memory access and energy efficiency.

Findings

01

Up to 10.3% performance improvement

02

Up to 11.3% DRAM energy reduction

03

Effective in heterogeneous CPU+GPU systems

Abstract

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, the memory becomes a bottleneck of GPU's performance and energy efficiency. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Secondly, a thread…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Interconnection Networks and Systems

Full text

Thread Batching for High-performance Energy-efficient GPU Memory Design

Bing Li

0000-0003-0732-2267

Duke UniversityDepartment of Electrical and Computer EngineeringDurhamNC27701USA

Army Research Office, Research Triangle ParkUSA

[email protected]

,

Mengjie Mao

MathWorks Inc.USA

[email protected]

,

Xiaoxiao Liu

AMDUSA

[email protected]

,

Tao Liu

Florida International UniversityDepartment of Electrical and Computer EngineeringMiamiFL33174USA

[email protected]

,

Zihao Liu

Florida International UniversityDepartment of Electrical and Computer EngineeringMiamiFL33174USA

[email protected]

,

Wujie Wen

Florida International UniversityDepartment of Electrical and Computer EngineeringMiamiFL33174USA

[email protected]

,

Yiran Chen

Duke UniversityDepartment of Electrical and Computer EngineeringDurhamNC27701USA

[email protected]

and

Hai (Helen) Li

Duke UniversityDepartment of Electrical and Computer EngineeringDurhamNC27701USA

[email protected]

Abstract.

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Secondly, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

GPU, Memory partitioning, Thread batch, Warp scheduler

This work was supported by US Department of Energy (DOE) under Grant SC0017030. Bing Li is supported by the NRC Associate Fellowship Award. This work is an extended version of a paper that was published at the Proceedings of Design Automation Conference (DAC) 2016, entitled ”TEMP: Thread Batch Enabled Memory Partitioning for GPU”.

††copyright: acmcopyright††journal: JETC††journalyear: 2019††journalvolume: 1††journalnumber: 1††article: 1††publicationmonth: 1††price: 15.00††copyright: acmcopyright††doi: 000000x.000000x††ccs: Computer systems organization Architectures††ccs: Computer systems organization Multicore architectures††ccs: Hardware Power estimation and optimization

1. Introduction

The use of Graphics Processing Units (GPUs) has been extended from fixed graphics acceleration to general purpose computing, including image processing, computer vision, machine learning, and scientific computing. GPU is widely employed in various platforms ranging from embedded systems to high-performance computing systems (Shimpi et al., 2012).

GPU heavily relies on massive threading to achieve high throughput. However, it commonly incurs intensive memory accesses, which may limit the performance and energy efficiency of GPU (Jog et al., 2013b) as the result of the high overhead of device memory access111In this work, we use device memory and memory interchangeably.. Though large-capacity and low-overhead cache have been adopted by GPU to alleviate the impact of inefficient memory accesses (Abdel-Majeed and Annavaram, 2013; Mao et al., 2014), the available cache per thread is far below the demand of most GPU applications (Jia et al., 2012). The pressures on device memory, i.e., DRAMs, in GPU are still severe.

Memory scheduling is one of the primary architectural techniques to improve memory efficiency as it is able to optimize the memory access parallelism and locality in multi-core systems (Mutlu and Moscibroda, 2007, 2008; Kim et al., 2010; Ebrahimi et al., 2011; Usui et al., 2016). However, the existing memory scheduling algorithms are usually associated with expensive implementation (Liu et al., 2012) and also insufficient to handle the intensive memory accesses in GPU (Yuan et al., 2009; Ausavarungnirun et al., 2012).

The memory partitioning (MP) based on operating system (OS) memory management is another viable approach to improve memory efficiency and reduce inter-thread memory interference. Memory partitioning generally divides memory resources and assigns them to threads, and every thread accesses its exclusive memory space (Mi et al., 2010; Liu et al., 2012; Jeong et al., [n. d.]; Xie et al., 2014; Suzuki et al., 2013). Memory partitioning is promising to improve memory efficiency in the GPU system because of the following reasons: 1) The memory address space in the heterogeneous system is pageable. The memory pages can be allocated to the GPU threads by OS; and 2) The threads in GPU are nearly homogeneous. When they are evenly dispatched to stream multiprocessors (SMs), the fairness and parallelism of their access to memory can be guaranteed. This assertion, however, may be invalid in other multi-core systems due to the disparity of memory bandwidth required by their threads (Xie et al., 2014).

Unfortunately, the existing memory partitioning mechanisms for multi-core systems cannot straightly applied to the GPU. For instance, the memory bank partitioning (MBP) (Mi et al., 2010; Jeong et al., [n. d.]; Liu et al., 2012; Xie et al., 2014), which allows each thread to access the exclusive memory bank. MBP aims for the multi-program systems which have few parallel threads. Differently, GPU always runs massive threads, of which the number is orders of magnitude larger than the available banks. It is impossible to allow every thread has the exclusive memory bank. Moreover, all threads in a GPU application share an unified address space (NVIDIA, 2009). Their memory accesses interweave together and is difficult to be separated by using memory partitioning technique.

To address the above problems, we propose an integrated solution to improve the performance and energy efficiency of GPU applications. The integrated solution is composed of the thread batch enabled memory partitioning (TEMP) to enhancing the memory access parallelism and the thread batch-aware scheduling (TBAS) to improve the memory access locality. Specifically, TEMP assigns the majority of memory requests from the same SM to the dedicated banks to ensure the parallelism of memory accesses of threads. The thread blocks that share the same set of pages are grouped into a thread batch and then are dispatched to an SM as a whole. Meanwhile, by applying the page coloring mechanism, the accessed pages are mapped to the dedicated banks which are associated with the same SM (Lin et al., 2008). In this way, TEMP minimizes the interference of memory accesses from different SMs and improves the parallelism of memory accesses. Moreover, TBAS prioritizes the execution of thread batches to preserve the locality of memory accesses. The thread batches that access the same row in one bank are clustered and scheduled together. Accordingly, TBAS effectively alleviates the contention on the memory controllers and the congestion on the reply network connecting the memory partitions to SMs.

We compare TEMP and TBAS with some representative thread scheduling techniques, including the cache-conscious wavefront scheduler (CCWS) (Rogers et al., 2012), OWL (Jog et al., 2013a) and the bandwidth-aware policy (BW-AWARE) (Agarwal et al., 2015). We set CCWS as our baseline and integrate OWL, BW-AWARE and our techniques on top of CCWS. The benchmarks consist of not only GPU applications but also the combined CPU-GPU applications. Experimental results show that after applying TEMP and TBAS, the GPU system achieves 10.3% performance improvement and 11.3% reduction of the DRAM energy consumption for the evaluated GPU applicationscompared to the baseline. The results of the combined CPU-GPU workloads demonstrate that a simple yet effective solution is capable of addressing the interference incurred by the CPU executions and ensuring high execution efficiency in GPU applications using TEMP and TBAS with negligible performance degradation on the CPU side.

The rest of this paper is organized as follows: Section 2 introduces the background of the heterogeneous CPU-GPU system and memory system; Section 3 and Section 4 describe the details of TEMP and TBAS, respectively; Section 5 summarizes our experimental setup; Section 6 presents the experimental results and related analyses; Section 7 discusses the related works; and Section 8 concludes our work.

2. Background

The heterogeneous CPU-GPU integrated systems are evolving towards unified memory address space (Chu, 2013). Because of discrepant bandwidth requirements, it is anticipated that GPU will be still physically attached with bandwidth-optimized DRAM, while CPU is attached with capacity- and cost-optimized DRAM. DRAMs of GPU and CPU share a unified memory address space (Agarwal et al., 2015). In such heterogeneous cache coherent non-uniform memory access (CC-NUMA) system, a computing unit has different access delays to local and remote memories even it sees a unified address space. Fig. 1 shows a heterogeneous CC-NUMA system including several CPUs and a GPU. The system interconnection networks bridge two memories and maintain the coherence between caches of the CPUs and the GPU.

Heterogeneous CC-NUMA allows better programmability and finer-grained memory management of the GPU. OS can allocate the GPU pages in all memories. In this work, we use the default NUMA page placement policy in Linux, i.e., local, which places as many pages as possible in the local memory. By using local policy we can avoid most bandwidth contentions between the CPUs and the GPU in heterogeneous CC-NUMA.

GPU programming models such as CUDA (NVIDIA, [n. d.]a) and OpenCL (Inc., [n. d.]b) define the workload offloaded to a GPU as a kernel. A kernel is highly multi-threaded where all the threads are encapsulated in a grid. Within a grid, the threads are partitioned into three-dimensional thread blocks, each of which contains up to thousands of threads. During executions, each thread block is dispatched as a whole to a SM. Every SM holds a complete single instruction multiple data (SIMD) pipeline. Each thread block in the SM is further partitioned into many fixed-size warps that are atomically scheduled by a warp scheduler and executed in the SIMD fashion. The L2 caches of CPUs and of GPU are separated and placed in different memory partitions, each of which has its own memory channel. The on-chip caches, including L1 data and instruction caches in the CPUs and the GPU, are connected to the L2 caches via a mesh network. In such design, GPU can use page-fault memory rather than be restricted to page-locked memory (Branover et al., 2012) and non-pageable memory (NVIDIA, 2009).

2.1. Heterogeneous CC-NUMA

2.2. DRAM Basics

A modern JEDEC compliant DDRx DRAM system consists of one or more channels, each of which has its own data buses, command buses, and address transferring. Fig. 2(a) depicts the basic organization of a DRAM channel, which also has a memory controller (MC) to control the operations on the channel. A channel may include multiple DIMMs. Within each DIMM, there are several ranks, each of which consists of multiple DRAM devices. In DDR3, a DRAM device contains eight banks. The data of each bank are always pre-loaded to its private row buffer before being accessed.

DRAM address mapping complies with the DRAM organization. The address mapping scheme in Fig. 2(b) (Bakhoda et al., 2009) is the baseline DRAM address mapping we used in our heterogeneous architecture. The address mapping scheme in Fig. 2(c) is used for the page coloring mechanism in our work. If the number of page offset bits is not greater than the sum of the column and byte offset bits, by using page coloring, a GPU page can be mapped to arbitrary channel, rank, bank or row in a bank.

The memory usage efficiency is mainly determined by bank-level parallelism (BLP) (Mutlu and Moscibroda, 2008) and row locality measured by row buffer hit rate (RBHR). All the banks in a DRAM can be accessed concurrently as each bank has its own address decoder and sensing logic. However, only one bank can put/receive the data on/from the shared bus at a time. All memory requests (reads and writes) need to go through the row buffer. Memory access latency and energy can be reduced when the access hits on the row buffer as no row activation is needed. In multi-core systems, a variety of memory schedulers (Mutlu and Moscibroda, 2007, 2008) have been proposed to improve the BLP and row locality as well as maximize the access fairness. However, these designs are generally insufficient to handle the massive parallel memory requests of GPU (Yuan et al., 2009). In this work, we propose TEMP and TBAS to improve the DRAM efficiency in GPU by minimizing the inter-SM interference of memory accesses, which is the root reason of low BLP and low row locality of DRAM accesses (Jeong et al., [n. d.]).

3. Thread Batch Enabled Memory Partitioning (TEMP)

A naïve GPU memory partitioning may bind each SM to one or more banks. All the pages accessed by a thread block can be placed to the banks bound to the SM where the thread block is executed. Ideally, if there is no shared page among different thread blocks, the banks can be exclusively accessed by the associated SM. Unfortunately, page sharing between thread blocks commonly exists in GPU kernels. The simple page placement mentioned above is unable to separate the memory access streams raised from different SMs. To address the issue, we propose TEMP which identifies and forms the thread blocks sharing pages (Section 3.1) and dispatches them to the same SM (Section 3.2) so as to minimize the inter-SM interference of memory accesses. The group of these thread blocks sharing pages is noted as a thread batch. The rest of this section will detail the design and implementation of TEMP.

3.1. Thread Batch Formation

By profiling the prevalent GPU benchmark suites, we observed there was two major types of thread-data mappings with some page sharing patterns in thread blocks222In this work, we only consider the kernels constructed with 1D and 2D thread block/grid, because none of the profiled benchmarks employs 3D thread block/grid (see Table 1).. The first type of thread-data mappings is: the data accessed by each thread block is clustered over a sequential address space. Fig. 3 shows the skeleton of the Mapper kernel in MapReduce engine of Mars (He et al., 2008). This kernel employs fixed 1D thread blocks and scatters them to 1D or 2D grid. Generally, consecutive thread blocks sequentially access the 1D vector inputKeys, and each thread block accesses a linear address space ranging from recordBase to terminate within inputKeys.

Fig. 4 simplifies and visualizes the first type of thread-data mapping. In this example we assume the grid of the kernel contains four thread blocks, each of which consists of four threads. The 1D thread blocks are arranged in a 2D grid. Their accessed data matrix is shown in Fig. 4(b). In this example, the first row of the data matrix is accessed by thread block (0,0,0), the second row is accessed by thread block (1,0,0), and so on. If the row address of the matrix aligns to a page, the SM-level page coloring can perfectly place the pages accessed by a SM to the bounded banks, as depicted in Fig. 4(c). Here a page is equal to a matrix row. However, if a page is composed of multiple matrix rows, say, two matrix rows, conventional thread block dispatching which interleaves thread blocks across SMs will generate interweaved memory accesses, as shown in Fig. 4(d). In order to address the situation, we can pack those thread blocks accessing the same set of pages into a thread batch and then dispatch the thread batch as a whole to a SM. For the example shown in Fig. 4(d), the 4 thread blocks can be grouped into 2 thread batches, each of which goes to a SM. The memory accesses to banks 0 and 1 are successfully separated, as illustrated in Fig. 4(e).

The second type of thread-data mappings is that the data accessed by consecutive thread blocks are interleaved over a linear address space. Fig. 5 shows the code snippet of the cenergy kernel in the CUTCP benchmark (Stratton et al., 2012). CUTCP computes the coulombic potential at a molecular grid energygrid. A point in energygrid is indexed by xindex and yindex generated from a thread’s indexes. All threads form a 2D grid which is further tiled with 2D thread blocks. Fig. 6 demonstrates a simplified thread-data mapping in this 2D grid. The thread organization and accessed data matrix can be found in Fig. 6(a) and (b), respectively. Here, we again assume one grid has four thread blocks, and each thread block has four threads. In this example, every thread block has two active dimensions ( $x$ -axis and $y$ -axis). Each matrix row is accessed by two thread blocks while each thread block accesses two rows. In such a situation, the consecutive thread blocks likely access the same set of pages. Similarly, we can pack those thread blocks sharing the same set of pages into one thread batch. Fig. 6(c) gives a thread batching example where every matrix row in Fig. 6(b) exactly forms one page. Thread blocks (0,0,0) and (1,0,0) share pages 0 and 1, while thread blocks (0,1,0) and (1,1,0) share pages 2 and 3. Consequently, we can group thread blocks (0,0,0) and (1,0,0) into thread batch 0 and thread blocks (0,1,0) and (1,1,0) into thread batch 1. By allocating pages 0 & 1 into bank 0 and pages 2 & 3 into bank 1, the memory accesses from SM 0 to bank 0 and from SM 1 to bank 1 are separated.

Those two major thread-data mapping scenarios indicate consecutive thread blocks may share pages. Accordingly, we introduce the thread block stride to indicate the number of the consecutive thread blocks that belong to the same thread batch. In the examples in Fig. 4(c) and 6(c), the thread block stride is 1 and 2, respectively.

To find the thread block stride of a GPU kernel, we profile a kernel given a page size at the compile time when the programmer determines the thread hierarchy and how the threads access the data matrices.

At the profiling stage, the start addresses of data matrices are set to zero. During dynamic memory allocation, the start memory address of a data matrix align to the beginning of the pages to guarantee the thread block stride to be found in the compile time. Fig. 7 shows the optimal thread block stride of some GPU applications. Optimal thread block stride denotes the thread block stride suppressing the most cross-batch page sharing. Here, the page size is 4KB supported by most of the computer systems. 89% of kernels achieve the minimum inter-thread batch page sharing through a batch formation with a fixed thread block stride. There are also 6% of kernels where the batch formation can be realized using modulation. Some kernels in MUM and LBM cannot be fitted with a formula for the batch formation.

The static compile-time profiling is sub-optimal since it cannot proactively remove the cross-batch page sharing. For example, the last thread block in a thread batch may share a page with the first thread block in its following thread batch, if those thread batches are formed with a fixed thread block stride. In the next section we introduce a simple dynamic hardware approach which can support thread batching better relative to the static profiling.

We further analyze some GPU applications which form thread batches with the fixed thread block stride. The accumulated percentage of the pages shared by different sizes of consecutive thread batches is shown in Fig. 8. Horizontal axis shows the maximal distance of the shared pages among the thread batches. Among all the accessed pages, nearly 75% on average is exclusively accessed by a single thread batch and 22% is accessed by two consecutive thread batches. These two cases dominate the page access patterns in the thread batches ( $>97\%$ ). There are more than 2% of pages globally shared among all the thread batches in a kernel, such as program text pages.

3.2. Serial Thread Block Dispatching

Given that the thread batching and the cross-batch page sharing dominate the GPU applications, we propose serial thread block dispatching. The consecutive thread blocks, which are very likely enclosed by the consecutive thread batches, are emitted to a SM. As such most thread batches are formed implicitly by the serial thread block dispatching, and most cross-batch page sharing are constrained within a SM. Now the cross-batch page sharing only happens when some thread blocks of a thread batch are distributed to multiple SMs. This would happen in the first and the last thread batch in an SM.

Traditional interleaved thread block dispatching, e.g., GigaThread engine in NVIDIA GPU (NVIDIA, 2009), generates and dispatches a new thread block to an SM once the SM has an idle slot. Typically, the dispatching unit only passes the id of the new thread block to the SM, and the SM will construct a whole thread block according to the received thread block id. The dispatching unit generates the thread block ids sequentially and the thread block ids are dispatched to SMs randomly. To implement the deterministic and serial thread block dispatching, we introduce a dispatch queue in each SM. The content, i.e., the thread block ids, in the dispatch queue are inserted before launching a kernel. Each SM receives similar amount of thread block ids in consideration of workload balance, which can be determined at the compile time. During the kernel execution, the thread block ids are popped from the dispatch queue and emitted to the associated SM.

Compared to the traditional thread block dispatching, serial thread block dispatching avoids the stall of the launch of thread blocks. An SM can always pop a thread block id from its dispatch queue once it has an idle slot. The implementation of the dispatch queue can be highly efficient since each SM only needs two extra registers to record the head and the tail of thread block ids. The head register increments by one once a new thread block id (the head register itself) is popped. The dispatching of the thread block ends when the head register meets the tail of the thread block id stored in the second register. Thus, the serial thread block dispatching incurs marginal run-time and hardware overheads.

4. Thread Batch-aware Scheduling (TBAS)

TEMP constrains the memory accesses from a SM within the associated memory banks, offering an opportunity to improve intra-bank/row locality by scheduling the execution of threads. Accordingly, we propose TBAS that can be explained using the example in Fig. 9.

Fig. 9(a) presents the thread organization and the data matrix in the example. In a GPU, there is only one SM (i.e., SM0) associated with its own DRAM bank. Four thread batches, each of which consists of only one thread block, are formed and dispatched to SM0. Every thread batch exclusively accesses its own page while the page layout of SM0’s bank is also shown in Fig. 9(a). We assume two pages are included in one row in the bank333Generally, the row size of a DRAM is multiple times greater than the smallest page size that the OS can support.. Every two threads in a thread block forms a warp. Since there are four threads in one thread block, each thread block has two warps and total eight warps (or four thread blocks) are running on SM0.

Fig. 9(b) shows the execution of SM0 with a cache-conscious wavefront scheduler (CCWS) (Rogers et al., 2012). CCWS was designed for improving the L1 cache locality in GPUs. It captures the intra-warp locality and decreases the L1 thrashing by limiting the number of active warps in a SM based on the L1 eviction information. Typically, CCWS only keeps a subset of warps running in SMs and throttles the rest of warps pending in the same SM if the cache thrashing is detected. Once a warp in the running set encounters a stall, it will be demoted to the pending set. Simultaneously, another warp in the pending set will be promoted to the running set. Here, we assume that a running set includes two warps. It is very likely that the two warps in a running set come from different thread batches. Hence, they may compete for different rows in the bank and degrade the row locality.

We can propose a better scheduling policy to improve the row locality, as depicted in Fig. 9(c): the running set gathers active warps of the same thread batch as they commonly access the same page (i.e., the same row). If the thread batch in running set does not have sufficient active warps, all the warps of this thread batch are demoted to the pending set, and a new thread batch that has sufficient active warps will be promoted to the running set.

In such a design, promoting warps may harm the row locality when the rows accessed by the previous active warps and the newly promoted ones are different. Hence, as shown in Fig. 9(d), a better promotion scheme can promote a thread batch that is the successor of the demoted thread batch, e.g., promoting (1,0,0) (or (1,1,0)) after demoting (0,0,0) (or (0,1,0)). Due to the page allocation mechanism, the adjacent thread batch is most likely to access the same row in the bank.

The above sequential thread batch switching often results in a round-robin execution sequence, potentially incurring the burst of memory accesses in a short time. As illustrated in Fig. 9(d), all memory accesses are evoked in the first four scheduling cycles. The situations that may harm the scheduling efficiency include: 1) A thread batch demoted by a long operation could access the same page again in the near future. However, it may not be scheduled again in time; 2) When the thread batches are continuously promoted to the running set, the generated memory-accesses burst is coupled with the lost locality. The prolonged queuing delay in memory controllers may overwhelm the reply network connecting memory controllers and SMs (Bakhoda et al., 2010).

To overcome the above drawbacks, we assign higher promotion priority to older thread batches in the pending set. We assume the priority of the thread batches in Fig. 9(a) descends from the left to the right and then from the top to the bottom. Fig. 9(e) shows the scheduling sequence of the thread batches considering our proposed promotion priority. The improvement of row locality, especially the decreasing of memory access burst, leads to significant reduction in average memory access latency. We name the scheduling method corresponding to the example presented in Fig. 9(e) as TBAS.

Besides the maintenance of intra-/inter-thread batch row locality and alleviation on congestion of reply network, TBAS also reduces the stretch of memory access footprint by limiting the active thread batches in a particular time window. Such a limitation on thread-level parallelism can bring in an implicit positive effect on the cache locality (Rogers et al., 2012) as we shall explain in Section 6.1.

The hardware overhead of TBAS is similar to that of CCWS except for the promotion priority arbitrator. Fortunately, the number of concurrent thread batches in an SM is usually small: An SM of Fermi GPU, for example, supports only eight concurrent thread blocks (or at most 8 thread batches). Therefore, the implementation overhead of the arbitrator is negligible.

5. Experiment Methodology

5.1. Benchmark

We adopt a set of diverse GPU applications from (NVIDIA, [n. d.]b; Bakhoda et al., 2009; Che et al., 2009; Jog et al., 2013a; Stratton et al., 2012) as our benchmark used in our evaluations. Most of the applications are fully simulated except for the applications from (Jog et al., 2013a) of which only the first two billion instructions are simulated. The detailed characteristics of each application in the benchmark are summarized in Table 1. All GPU applications are profiled to generate the optimal thread batches before execution.

We combined eight CPU applications with GPU application to construct the heterogeneous workloads for the evaluation. The CPU workloads are from SPEC CPU 2006, as shown in Table 2. PinPoint (Luk et al., 2005) is used to extract the execution phases for all CPU applications. The CPU applications are divided into two types: memory intensive where the L2 cache misses per kilo instructions (MPKI) is higher than 20; and memory non-intensive where the L2 cache MPKI is lower than 20. The GPU applications can be also classified into two types based on L2 cache MPKI – memory intensive (MPKI $>$ 2) and non-intensive (MPKI $<$ 2). Although the L2 cache MPKI of most GPU applications are lower than that of CPU applications, within an arbitrary time window, GPU applications possibly generate two orders of magnitude greater L2 cache misses than CPU applications due to their high instruction throughput (i.e., IPC). Moreover, we grouped GPU application into three categories, C1–C3, according to their sensitivity to TEMP+TBAS (shall be explained in Section 6.1).

We permute the combination of different types of CPU and GPU applications to create twelve heterogeneous workloads. Each workload consists of two CPU applications and one GPU application, as summarized in Table 3. We construct ten workloads (WL0–WL9 in Table 3) where the GPU applications are picked up from C1. Half of GPU applications in WL0–WL9 are memory intensive, while the rest are memory non-intensive. For the CPU workloads in WL0–WL9, we can have three combination types (i.e., NN, IN, and II) of the dual-applications. The generated ten heterogeneous workloads cover most cases where EMU may act variably. We also construct two extra workloads, i.e., WL10 and WL11, each of which consists of one GPU application from C2 and C3, respectively.

5.2. Simulation Platform

Since the CPU-GPU CC-NUMA has not been shipped by any industrial vendors, we simulate a GPU system attached with a heterogeneous GDDR5-DDR3 DRAM subsystem. Our system simulation is performed on gem5-gpu (Power et al., 2015), and its configuration is listed in Table 4.

The GPU subsystem includes 8 SMs. Each SM has the similar computational capability as the SMs in Fermi and is set to the $\mathrm{600MHz}$ frequency. The memory bandwidth per shared-core-clock is comparable and even higher than that of real high-end heterogeneous processors integrating similar GPU unit (Inc, [n. d.]). As such, we ensure that our platform resembles real product and conducts fair evaluations.

The page size is set to 4KB, a typical size adopted widely. To avoid the bottleneck of GPGPU TLB and expose the limitation of DRAM bandwidth in heterogeneous shared memory systems, we also optimize the GPU TLB design in our heterogeneous system including per-SM TLB, highly-threaded PTW and shared L2 TLB (Power et al., 2014). We choose the configuration with CCWS in (Rogers et al., 2012) as our baseline.

We estimate the GDDR5 DRAM energy consumption through a modified Micron DRAM power calculator (Micron, [n. d.]a) based on the datasheet (Micron, [n. d.]b); the DDR3 DRAM energy consumption is directly obtained from Micron DRAM power calculator by feeding the run-time statistics generated from gem5-gpu.

To evaluate the effectiveness of TEMP and TBAS, we compared the following approaches:

•

CCSW refers to the design for improving the L1 cache locality in GPU proposed by (Rogers et al., 2012). The results of CCSW are used as the normalization basis in our evaluations.

•

OWL denotes the optimized scheduling method proposed by (Jog et al., 2013a), which improves the performance through optimizing the cache and memory accesses in GPU systems.

•

TEMP denotes the thread batch enabled memory partitioning scheme presented in Section 3.

•

TEMP+TBAS refers to the design integrating TEMP and TBAS.

•

BW-AWARE denotes a synergistic bandwidth-aware page placement policy in (Agarwal et al., 2015). It places the GPU pages across the heterogeneous memory system, i.e., GDDR5 and DDR3 DRAM, and their memory bandwidth is shared across GPU pages.

•

Batching+BW refers to the scheme that combines TEMP, TBAS, and BW-AWARE.

6. Result

6.1. Evaluation Results for GPU Applications

6.1.1. Performance

We first evaluate and analyze the performance and local access ratio to each memory bank across the different designs for the GPU applications. Here, local access denotes the memory access from the SM associated with the banks, while remote access refers to the access from other SMs. According to the performance results under the TEMP design and the evaluated local access ratio, GPU applications are classified into the following three categories:

•

C1: These applications present the high local access ratio (on average $>$ 99%) and significant performance improvement across all the configurations employed by TEMP.

•

C2: Similar to C1, the applications in C2 also demonstrate high local access ratio ( $>$ 93%). In contrast, they present a slight performance reduction ( $\sim$ 1%) under TEMP yet the effective performance improvement under TEMP+TBAS.

•

C3: The applications in C3 do not have high local access ratio due to the intrinsic thread-data mapping and memory access pattern. Their overall performance applied with TBAS and TEMP is degraded compared with those of CCWS.

The performance results are shown in Fig. 10, and Fig. 11 shows the local access ratio for the GPU applications.

The overall results show that applying TEMP on top of CCWS achieves 5.7% geometric mean (GM) speedup while replacing CCWS with TBAS (i.e., TEMP+TBAS) can further raise the speedup to 10.3%. Based on our evaluations, OWL is 93.6% within the performance of CCWS across the application workloads. As shown in Fig. 11 and Fig. 12, the cache hit rate of OWL is lower than that of CCWS, and the BLP improvement achieved by OWL is limited. The results verified that only considering a small subset of thread blocks which share pages is insufficient to achieve remarkable performance improvement. The IPC of TEMP is 12.9% higher than that of OWL. BW-AWARE keeps a page placement ratio the same as the bandwidth ratio between GDDR5 and DDR3, which can improve the utilization of the combined bandwidth from both memories. Hence, BW-AWARE gains 5.1% performance improvement over CCWS as can be seen from Fig. 10. The performance gain is compliance to the value reported in (Agarwal et al., 2015) by given the similar bandwidth ratio.

To further evaluate the effects of these designs on the memory requests for the three categories of the GPU applications, we summarize the DRAM usage statistic (BLP, RBHR, DRAM access delay) as well as the stalls on reply network connecting memory controllers and SMs induced by the network congestion of these designs. The results are normalized to those in CCWS and shown in Fig. 12.

When applying TEMP on C1, the BLP of C1 is significantly improved by 58.3%, while the RBHR is increased by 17.8%. As expected, by suppressing the inter-SM interference of memory accesses, TEMP unveils the intrinsic locality and access parallelism of thread batches. In comparison to TEMP, OWL improves BLP by 16.3% and RBHR by 8.6%, respectively. The opportunistic prefetching adopted by OWL boosts RBHR.We also investigated the network congestion between the SMs and the GDDR5 DRAM partitions. The network congestion of OWL is 33.6% more than that of CCWS. This value quantitatively demonstrates that CCWS has a higher L1 cache hit rate, less L2 accesses, and less DRAM accesses compared to OWL. All the above factors together lead to 17.3% reduction in DRAM access delay with TEMP in C1. Consequently, TEMP achieves 11.1% performance improvement over CCWS, which is 24.0% higher than OWL. For C1, the BLP in TEMP+TBAS is 9.1% smaller than that in TEMP. This is because the number of active thread batches is intentionally limited for row locality enhancement. On the other hand, C1’s RBHR in TEMP+TBAS is raised by 33.1% and the DRAM access delay is reduced by 29.9%. More importantly, a considerable reduction in network congestion (18.7%) is observed. As a result, more than 15% performance improvement is achieved by TEMP+TBAS for C1 as shown in Fig. 10.

C2 achieves a high local access rate when TEMP is applied. However, TEMP is hard to increase the BLP of C2 since the BLP of C2 already approaches the theoretical upper bound. For instance, some kernels in NN have only a few thread blocks whose number is even lower than the bank count. Applying TEMP on those kernels may limit the BLP. Fortunately, TBAS enhances the row locality and reduces the network congestion, resulting in slight speedup ( $\sim$ 2%). As shown in Fig. 10, the performance of C3 in TEMP/TEMP+TBAS is averagely degraded/improved by 2.5%/2.3%. Note that it is difficult to formalize the thread-data mapping of the applications in C3. Thus, applying TEMP for C3 prolongs DRAM access delay.

6.1.2. Energy

The normalized DRAM energy consumption of all configurations is shown in Fig. 13. Generally, the DRAM energy savings come from two main sources: 1) The saving of activate energy that dominates DRAM energy consumption, which can be achieved by increasing RBHR; and 2) The saving of the background energy, which is proportional to the reduction of the execution time. Therefore, DRAM energy reduction is relevant to the improved access locality as well as the overall performance improvement. Our results show that compared to CCWS, the DRAM energy saving of TEMP is 11.2%. TEMP+TBAS saves 20.7% more energy than CCWS because of the significantly improved RBHR. OWL saves 5.9% energy which is less than TEMP+TBAS as the result of the higher row activation ratio and worse performance. Batching+BW achieves the highest energy saving of 14.2%.

6.2. Evaluation for Heterogeneous Workloads

Fig. 14 shows the performance of the CPU applications (WS-C) and the GPU application (IPC-G) in each heterogeneous workload when TEMP+TBAS is applied. The performance of the CPU applications in a workload is measured by the weighted speedup (Eyerman and Eeckhout, 2008). These results are normalized to the weighted speedup of the same CPU applications running standalone on the heterogeneous system. The IPC of a GPU application is also normalized to the IPC obtained by exclusively running with TEMP+TBAS . The memory-intensive CPU and GPU applications in the workloads suffer from non-trivial performance degradation due to the contention for shared resources, e.g., interconnection network and DRAM. On the contrary, the performance degradation of memory non-intensive applications is much less. The weighted performance of CPU applications across twelve workloads is reduced by 11.9%; correspondingly, the IPC of GPU applications is 9.2% lower than that obtained by TEMP+TBAS running alone.

The effectiveness of TEMP and TBAS is constrained in the CPU applications and hence, the performance of the CPU applications is degraded as the CPU applications: 1) TBAS expects consecutive thread blocks to access their physical pages in a limited span of rows. The physical addresses of the pages accessed by the CPU applications, however, can mix with those of the pages accessed by the GPU applications, deteriorating the row locality of the GPU applications; 2) On the other hand, even if TBAS successfully preserves the row locality of the GPU applications, the memory controller probably always prioritizes the intensive memory accesses from the GPU and suspends the memory accesses from the CPU.

To address the above problems, we can first divide each bank into two portions – one for CPU and one for GPU. We reserve the rows with higher addresses in a bank for CPU and the ones with lower addresses for GPU. The new pages for CPU and GPU are from the reserved address space. As such, most pages of CPU and GPU can be physically separated in a bank, which allows TBAS to keep the row locality of GPU applications when CPU applications are running simultaneously. Secondly, the memory controller is set to always promote the memory accesses from CPU against the ones from GPU, as proposed in (Ausavarungnirun et al., 2012). Since most CPU applications are delay-sensitive, unconditionally promoting the memory accesses from CPU can eliminate the risk of memory access starvation on the CPU-side. Combining the above two solutions, the performance loss in the CPU/GPU applications are reduced by 6.1%/3.5%, as denoted by Comb-C and Comb-G in Fig. 14. We can see that some workloads (e.g., WL8 and WL9) including both CPU and GPU intensive applications attain significant performance improvement from the integrated heterogeneous-aware thread batching.

The solutions mentioned above is simple yet capable of keeping the effectiveness of TEMP and TBAS for GPU applications while preventing considerable performance loss for CPU applications. We believe more sophisticated techniques can further balance the throughput between CPU and GPU (Kayıran et al., 2014). However, the balanced throughput design is beyond the interests of this paper and left for the future work.

7. Related Works

7.1. Memory Partitioning in Multi-core Systems

In multi-core systems, memory bank partitioning (MBP) binds a thread to one or more memory banks. Every thread accesses its own private banks to avoid the interference from other threads. Mi et al. (Mi et al., 2010) first proposed MBP and used modified bank permutation to compensate the degraded BLP. Jeong et al. (Jeong et al., [n. d.]) used sub-ranking to overcome the BLP degradation on single thread after applying MBP. Liu et al. (Liu et al., 2012) designed a purely software MBP based on OS page allocation. They also explored the utilization of MBP in a multi-threaded application but the result was not very promising because of the inter-thread data sharing. Xie et al. (Xie et al., 2014) pointed out that unbalanced memory requirements across the threads is the main reason of the BLP degradation and then proposed a dynamic bank partitioning approach to solve this problem. In TBMP, BLP is guaranteed by workload balancing across the SMs while the memory access fairness is guaranteed by the homogeneity of the GPU threads in a kernel. Thread batching in TEMP also alleviates the negative impact of inter-thread data sharing on system performance in multi-threaded applications.

7.2. DRAM Efficiency in GPU

Compiler-assisted data layout transformation (Yang et al., 2010; Sung et al., 2010; Xie et al., 2015) proactively prevents unbalanced accesses to DRAM components by carefully allocating the data, register file or the thread block index. For example, Xie et al. (Xie et al., 2015) put forward a compiler-based framework to balance the register allocation and the targeted thread-level parallelism in the GPU system. However, the compiler-level methods are not aware of any hardware implementation details. Both thread scheduling and DRAM address mapping at the hardware level may offset the optimization brought by the compiler level. The hardware-level approaches of enhancing DRAM usage efficiency in GPU or CPU-GPU systems include:

Enhanced memory schedulers: Jeong et al. (Jeong et al., 2012) designed a QoS-aware memory scheduler for MPSoC with CPUs and GPUs. The DRAM bandwidth allocation between the CPUs and GPUs is dynamically adjusted to meet the frame rate requirement of the GPUs and maximize the overall system throughput. Ausavarungnirun et al. (Ausavarungnirun et al., 2012) proposed a staged memory scheduling framework with affordable hardware cost for heterogeneous systems. We adopt the memory scheduling policy from (Ausavarungnirun et al., 2012) to customize our proposed heterogeneous-aware thread batching.

Enhanced thread scheduler: Jog et al. (Jog et al., 2013a) revealed that serial thread block data layout and sequential thread block dispatching can cause BLP degradation of GPU applications. A scheduler is then designed to improve the BLP by prioritizing the thread blocks in consecutive SMs. The authors also utilized prefetching to compensate the degradation of row locality. However, if the memory of a GPU is pageable, the effect of prioritized thread scheduling will become uncertain, because the pages of consecutive thread blocks can be nonconsecutive or not concentrated to a DRAM row. In our scheme, TEMP relies on thread batching and page coloring to improve the BLP and TBAS enhances the row locality, targeting a heterogeneous system design supporting pageable GPU memory.

8. Conclusion

Modern GPUs suffer from the mismatching between thread-level parallelism and DRAM bandwidth. To improve the DRAM usage efficiency of GPU applications, we propose an integrated architectural approach which is composed of TEMP and TBAS techniques: TEMP improves memory access parallelism for massive multi-threaded GPU applications by minimizing the memory access interweaving across SMs; and TBAS maximizes the row locality by elaborately prioritizing the execution of the thread batches. Heterogeneous-aware thread batching is also introduced to promise the effectiveness of thread batching when running heterogeneous workloads. Our results show that TEMP+TBAS can achieve up to 10.3% system performance improvement and 11.3% DRAM energy saving compared to the baseline employing CCWS. By using the simple and existing solution, the heterogeneous-aware thread batching can still maintain 93.9% CPU performance and 96.5% GPU performance compared to the results of exclusively running CPU and GPU applications.

Acknowledgements.

This work is supported in part by US National Science Foundation under Grant 1725456 and Grant 1615475; Bing Li acknowledges the National Academy of Sciences (NAS), USA for awarding the NRC research fellowship.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abdel-Majeed and Annavaram (2013) Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped Register File: A Power Efficient Register File for GPGP Us. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA ’13) . IEEE Computer Society, Washington, DC, USA, 412–423. https://doi.org/10.1109/HPCA.2013.6522337 · doi ↗
3Agarwal et al . (2015) Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W Keckler. 2015. Page placement strategies for GP Us within heterogeneous memory systems. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15) . ACM, New York, NY, USA, 607–618. https://doi.org/10.1145/2694344.2694381 · doi ↗
4Ausavarungnirun et al . (2012) Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA ’12) . IEEE Computer Society, Washington, DC, USA, 416–427. http://dl.acm.org/citation.cfm?id=2337159.2337207
5Bakhoda et al . (2010) Ali Bakhoda, John Kim, and Tor M Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’43) . IEEE Computer Society, Washington, DC, USA, 421–432. https://doi.org/10.1109/MICRO.2010.50 · doi ↗
6Bakhoda et al . (2009) Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software . 163–174. https://doi.org/10.1109/ISPASS.2009.4919648 · doi ↗
7Branover et al . (2012) Alexander Branover, Denis Foley, and Maurice Steinman. 2012. Amd Fusion apu: Llano. IEEE Micro 32, 2 (March 2012), 28–37. https://doi.org/10.1109/MM.2012.2 · doi ↗
8Che et al . (2009) Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC) . IEEE Computer Society, Austin, TX, USA, 44–54. https://doi.org/10.1109/IISWC.2009.5306797 · doi ↗