ClangJIT: Enhancing C++ with Just-in-Time Compilation

Hal Finkel; David Poliakoff; David F. Richards

arXiv:1904.08555·cs.PL·April 30, 2019

ClangJIT: Enhancing C++ with Just-in-Time Compilation

Hal Finkel, David Poliakoff, David F. Richards

PDF

TL;DR

ClangJIT introduces a dynamic, just-in-time compilation extension for C++ that enables runtime template specialization, leading to performance gains and improved productivity with minimal application modifications.

Contribution

It presents a novel JIT extension for C++ that allows runtime template specialization, enhancing performance and productivity in large-scale applications.

Findings

01

Significant performance improvements observed in large-scale applications.

02

Minimal code changes needed for integration.

03

Reduced compilation time through dynamic specialization.

Abstract

The C++ programming language is not only a keystone of the high-performance-computing ecosystem but has proven to be a successful base for portable parallel-programming frameworks. As is well known, C++ programmers use templates to specialize algorithms, thus allowing the compiler to generate highly-efficient code for specific parameters, data structures, and so on. This capability has been limited to those specializations that can be identified when the application is compiled, and in many critical cases, compiling all potentially-relevant specializations is not practical. ClangJIT provides a well-integrated C++ language extension allowing template-based specialization to occur during program execution. This capability has been implemented for use in large-scale applications, and we demonstrate that just-in-time-compilation-based dynamic specialization can be integrated into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

1

ClangJIT: Enhancing C++ with Just-in-Time Compilation

Hal Finkel

0000-0002-7551-7122

Lead, Compiler Technology and Programming LanguagesLeadership Computing FacilityArgonne National Laboratory9700 S Cass AveLemontIL60439USA

[email protected]

,

David Poliakoff

Lawrence Livermore National Laboratory7000 East AvenueLivermoreCA94550USA

[email protected]

and

David F. Richards

Lawrence Livermore National Laboratory7000 East AvenueLivermoreCA94550USA

[email protected]

Abstract.

The C++ programming language is not only a keystone of the high-performance-computing ecosystem but has proven to be a successful base for portable parallel-programming frameworks. As is well known, C++ programmers use templates to specialize algorithms, thus allowing the compiler to generate highly-efficient code for specific parameters, data structures, and so on. This capability has been limited to those specializations that can be identified when the application is compiled, and in many critical cases, compiling all potentially-relevant specializations is not practical. ClangJIT provides a well-integrated C++ language extension allowing template-based specialization to occur during program execution. This capability has been implemented for use in large-scale applications, and we demonstrate that just-in-time-compilation-based dynamic specialization can be integrated into applications, often requiring minimal changes (or no changes) to the applications themselves, providing significant performance improvements, programmer-productivity improvements, and decreased compilation time.

C++, Clang, LLVM, Just-in-Time, Specialization

††journalyear: 2019††copyright: none††ccs: Software and its engineering Just-in-time compilers

1. Introduction and Related Work

The C++ programming language is well-known for its design doctrine of leaving no room below it for another portable programming language (Stroustrup, 2007). As compiler technology has advanced, however, it has become clear that using just-in-time (JIT) compilation to provide runtime specialization can provide performance benefits practically unachievable with purely ahead-of-time (AoT) compilation. Some of the benefits of runtime specialization can be realized using aggressive multiversioning, both manual and automatic, along with runtime dispatch across these pre-generated code variants (this technique has been applied for many decades; e.g., (Byler, 1987)). This technique, however, comes with combinatorial compile-time cost, and as a result, is practically limited. Moreover, most of this compile-time cost from multiversioning is wasted when only a small subset of the specialized variants are actually used (as is often the case in practice). This paper introduces ClangJIT, an extension to the Clang C++ compiler which integrates just-in-time compilation into the otherwise-ahead-of-time-compiled C++ programming language. ClangJIT allows programmers to take advantage of their existing body of C++ code, but critically, defer the generation and optimization of template specializations until runtime using a relatively-natural extension to the core C++ programming language.

A significant design requirement for ClangJIT is that the runtime-compilation process not explicitly access the file system - only loading data from the running binary is permitted - which allows for deployment within environments where file-system access is either unavailable or prohibitively expensive. In addition, this requirement maintains the redistributibility of the binaries using the JIT-compilation features (i.e., they can run on systems where the source code is unavailable). For example, on large HPC deployments, especially on supercomputers with distributed file systems, extreme care is required whenever many simultaneously-running processes might access many files on the file system, and ClangJIT should elide these concerns altogether.

Moreover, as discussed below, ClangJIT achieves another important design goal of maximal, incremental reuse of the state of the compiler. As is well known, compiling C++ code, especially when many templates are involved, can be a slow process. ClangJIT does not start this process afresh for the entire translation unit each time a new template instantiation is requested. Instead, it starts with the state generated by the AoT compilation process, and from there, adds to it incrementally as new instantiations are required. This minimizes the time associated with the runtime compilation process.

ClangJIT is available online, see for more information:

https://github.com/hfinkel/llvm-project-cxxjit/wiki

1.1. Related Work

Clang, and thus ClangJIT, are built on top of the LLVM compiler infrastructure (Lattner and Adve, 2004). The LLVM compiler infrastructure has been specifically designed to support both AoT and JIT compilation, and has been used to implement JIT compilers for a variety of languages both general purpose (e.g., Java (Reames, 2017), Haskell (Terei and Chakravarty, 2010), Lua (Pall, 2008), Julia (Bezanson et al., 2012)), and domain specific (e.g., TensorFlow/XLA (Abadi et al., 2017)). In addition, C++ libraries implemented using LLVM to provide runtime specialization for specific domains are not uncommon (e.g., TensorFlow/XLA, Halide (Ragan-Kelley et al., 2013)).

Several existing projects have been developed to add dynamic capabilities to the C++ programming language (Binks, 2019). A significant number of these rely on running the compiler as a separate process in order to generate a shared library, and that shared library is then dynamically loaded into the running process (e.g., (Binks et al., 2013; Noack et al., 2017)). Unfortunately, not only do such systems require being able to execute a compiler with access to the relevant source files at runtime, but careful management is required to constrain the cost of the runtime compilation processes, because each time the compiler is spawned it must start processing its inputs afresh. The Easy::JIT project (Caamaño and Guelton, 2018) provides a limited JIT-based runtime specialization capability for C++ lambda functions to Clang, but this specialization is limited to parameter values, because the types are fixed during AoT compilation (a fork of Easy::JIT known as atJIT (Farvardin, 2018) adds some autotuning capabilities). NativeJIT (Hopcroft et al., 2014) provides an in-process JIT for a subset of C for x86_64. The closest known work to ClangJIT is CERN’s Cling (Vasilev et al., 2012) project. Cling also implements a JIT for C++ code using a modified version of Clang, but Cling’s goals are very different from the goals of ClangJIT. Cling effectively turns C++ into a JIT-compiled scripting language, including a REPL interface. ClangJIT, on the other hand, is designed for high-performance, incremental compilation of template instantiations using only information embedded in the hosting binary. Cling provides a dynamic compiler for C++ code, while ClangJIT provides a language extension for embedded JIT compilation, and as a result, the two serve different use cases and have significantly-different implementations.

The hypothesis of this work is that, for production-relevant C++ libraries used for this kind of specialization, incremental JIT compilation can produce performance benefits while simultaneously decreasing compilation time. The initial evaluations presented in Section 5 confirm this hypothesis, and future work will explore applicability and benefits in more detail.

The remainder of this paper is structured as follows: Section 2 discusses the syntax and semantics of ClangJIT’s language extension, Section 3 describes the implementation of ClangJIT’s ahead-of-time-compilation components, Section 4 describes the implementation of ClangJIT’s runtime components, Section 5 contains initial evaluation results, we discuss future work in Section 6, and the paper concludes in Section 7.

2. The Language Extension

A key design goal for ClangJIT is natural integration with the C++ language while making JIT compilation easy to use. A user can enable JIT-compilation support in the compiler simply by using the command line flang -fjit. Using this flag, both when compiling and when linking, is all that should be necessary to make using the JIT-compilation features possible. By itself, however, the command-line flag does not enable any use of JIT compilation. To do that, function templates can be tagged for JIT compilation by using the C++ attribute [[clang::jit]]. An attributed function template provides for additional features and restrictions. These features are:

•

Instantiations of this function template will not be constructed at compile time, but rather, calling a specialization of the template, or taking the address of a specialization of the template, will trigger the instantiation and compilation of the template at runtime.

•

Non-constant expressions may be provided for the non-type template parameters, and these values will be used at runtime to construct the type of the requested instantiation. See Listing 1 for a simple example.

•

Type arguments to the template can be provided as strings. If the argument is implicitly convertible to a const char *, then that conversion is performed, and the result is used to identify the requested type. Otherwise, if an object is provided, and that object has a member function named c_str(), and the result of that function can be converted to a const char *, then the call and conversion (if necessary) are performed in order to get a string used to identify the type. The string is parsed and analyzed to identify the type in the declaration context of the parent of the function triggering the instantiation. Whether types defined after the point in the source code that triggers the instantiation are available is not specified. See Listing 2 for a demonstration of this functionality, with Listing 3 showing some example output.

There are a few noteworthy restrictions:

•

Because the body of the template is not instantiated at compile time, decltype(auto) and any other type-deduction mechanisms depending on the body of the function are not available.

•

Because the template specializations are not compiled until runtime, they’re not available at compile time for use as non-type template arguments, etc.

Explicit specializations of a JIT function template are not JIT compiled, but rather, compiled during the regular AoT compilation process. If, at runtime, values are specified corresponding to some explicit specialization (which will have already been compiled), the template instantiation is not recompiled, but rather, the already-compiled function is used. An exception to this rule is that a JIT template with a pointer/reference non-type template parameter which is provided with a runtime pointer value will generate a different instantiation for each pointer value. If the pointer provided points to a global object, no attempt is made to map that pointer value back to the name of the global object when constructing the new type. This might seem like a bit of trivia, but has an important implication for the generated code. In general, pointer/reference-type non-type template arguments are not permitted to point to subobjects. This restriction still applies formally to the templates instantiated at runtime using runtime-provided pointer values. This has important optimization benefits: pointers that can be traced back to distinct underlying objects are known not to alias, and these template parameters appear to the optimizer to have this unique-object property. C++ does not yet have a restrict feature, as C does, to represent the lack of pointer aliasing, but this aspect of combining JIT compilation with templates provides C++ with this feature in a limited way111Nearly all C++ compilers support some form of C’s restrict keyword as an extension, so this is not the only extension that provides this functionality..

3. How it Works: Ahead-of-Time Compilation

Implementing ClangJIT required modifying Clang’s semantic-analysis and code-generation components in non-trivial ways. Clang’s parsing and semantic analysis was extended to allow the [[clang::jit]] attribute to appear on declarations and definitions of function templates. The most-significant modifications were to the code in Clang which determines whether a given template-argument list can be used to instantiate a given function template. In this case, when the function template in question has the [[clang::jit]] attribute, two important changes were made: For a candidate non-type template argument (e.g., an expression of type int), the candidate is allowed to match without the usual check to determine if constant evaluation is possible. For a candidate template argument that should be a type, if the candidate is instead a non-type argument, logic was added to check for conversion to const char *, first by calling a c_str() method if necessary, and if the conversion is possible, the candidate is allowed to match. Each time a JIT function template is instantiated, the instantiation is assigned a translation-unit-unique integer identifier which will be used during code generation and by the runtime library.

In order for the runtime library to instantiate templates dynamically, it requires a saved copy of Clang’s abstract-syntax tree (AST), which is the internal data structure on which template instantiation is performed. Fortunately, Clang already contains the infrastructure for serializing and deserializing its AST, and moreover, embedding compressed copies of the input source files along with the AST, as part of the implementation of its modules feature. The embedded source files are important so that, should an error occur during template instantiation (e.g., a static_assert is triggered), useful messages can be produced for the user. Reconstructing the parameters used for code generation (e.g., whether use of AVX-2 vector instructions is enabled when targeting the x86_64 architecture) also requires the set of command-line parameters passed by Clang’s driver to the underlying compilation invocation. In addition, in order to allow JIT-compiled code to access local variables in what would otherwise be their containing translation unit, the addresses of such potentially-required variables are saved for use by the runtime library. All of these items are embedded in the compiled object file, resulting, as illustrated by Figure 1, in a ”fat” object file containing both the compiled host code as well as the the information necessary to resume the compilation process during program execution.

When emitting a call to a direct callee, and when getting a function pointer to a given function, Clang’s code generation needs to generate a pointer to the relevant function. For JIT-compiled function-template instantiations, Clang is enhanced to generate a call to a special runtime function, __clang_jit, which returns the required function pointer. How this runtime function works will be discussed in Section 4. The runtime-dependent non-type template parameters for the particular template instantiation are packed into an on-stack structure, an on-stack array of strings representing runtime types is formed, and the addresses of these on-stack data structures are provided as arguments to the __clang_jit call. Two additional parameters are passed: first, a mangled name for the template being instantiated (with wildcard character sequences in places where the runtime values and types are used), used in combination with the runtime values to identify the instantiation, and second, the translation-unit-unique identifier for this particular instantiation. This translation-unit-unique identifier is used by the runtime library to look up this particular instantiation in the serialized AST.

Finally, Clang’s driver code was updated so that use of the -fjit flag not only enables processing of the [[clang::jit]] attributes during compilation, but also causes Clang’s implementation libraries to be linked with the application, and when producing dynamically-linked executables, -rdynamic is implied. -rdynamic is used so that the runtime library can use the executable’s exported dynamic symbol table to find external symbols from all translation units comprising the program (the alternative would require essentially duplicating this information in the array of local variables passed to the runtime library).

3.1. CUDA Support

In order to support JIT compilation of CUDA kernels for applications targeting NVIDIA GPUs, ClangJIT includes support for the CUDA programming model built on Clang’s native CUDA support (Wu et al., 2016). The primary challenge in supporting CUDA in ClangJIT derives from the fact that, when compiling CUDA code, the driver invokes the compiler multiple times: Once to compile for the host architecture and once for each targeted GPU architecture (e.g., sm_35). At runtime, the state of not only the host-targeting compiler invocation must be reconstructed, but also the state of one of these GPU-targeting compiler invocations (whichever most closely matches the device being used at runtime). Fortunately, Clang’s CUDA compilation workflow compiles code for the GPUs first, and only once all GPU-targeting compilation is complete, is the compiler targeting the host invoked. When JIT compilation and CUDA are both enabled, the driver creates a temporary LLVM bitcode file, and each GPU-targeting compilation saves the serialized AST, command-line parameters, and some additional metadata into a set of global variables in this bitcode file. The implementation takes advantage of LLVM’s appending linkage feature so that each GPU-targeting invocation can easily add pointers to its state data to an array of similar entries from all GPU-targeting invocations within the bitcode file. When the host-targeting compilation takes place, the bitcode file is loaded and linked into the LLVM module Clang is constructing and the address of the relevant global variable (which contains pointers to all of the other device-compilation-relevant global variables) becomes an argument to calls to __clang_jit. As illustrated by Figure 1, all of this information ends up embedded in the host object file.

4. How it Works: Runtime Compilation

ClangJIT’s runtime library is large, namely because it includes all of Clang and LLVM, but it has only one entry-point used by the compiler-generated code: __clang_jit. This function is used to instantiate function templates at runtime. The ClangJIT-specific parts of the runtime library are approximately two thousand lines of code, and an outline of the implementation is provided in Algorithm 1. The runtime library has a program-global cache of generated instantiations, but otherwise keeps separate state for each translation unit making use of JIT compilation. This per-translation-unit state is largely composed of two parts: first, a Clang compiler instance holding the AST and other data structures, and second, an in-memory LLVM IR module containing externally-available definitions. This LLVM IR module is initially populated by loading the optimized LLVM IR for the translation unit that is stored in the running binary, and marking all definitions with externally-available linkage, thus allowing the definitions to be analyzed and inlined into newly-generated code.

When the __clang_jit function is called to retrieve a pointer to a requested template instantiation, the instantiation is looked up in the cache of instantiations. This cache uses LLVM’s DenseMap data structure, along with a mutex, and while these constructs are designed to have high performance, this lookup can have noticeable overhead. If the instantation does not exist, as outlined in Algorithm 1, the instantiation is created along with any other new dependencies (e.g., a static function not otherwise used in the translation unit might now need to be emitted along with the requested instantiation). A new LLVM IR module is created by Clang, and that LLVM IR module is merged with the module containing the externally-available definitions. This combined module is optimized and JIT compiled. The newly-generated LLVM IR is then also merged, with externally-available linkage, into the module containing externally-available definitions for all previously-generated code. These externally-available definitions form a kind of cache used to enable inlining of previously-generated code into newly-generated code. This is important because if the incremental code generation associated with JIT compilation could not inline code generated in other stages, then the JIT-compiled code would likely be slower than the AoT-compiled code.

Clang already uses LLVM’s virtual-file-system infrastructure and this allows isolating it from any file-system access at runtime. As noted earler, it is important to avoid JIT compilation triggering any file-system access, and so, the Clang compiler instance created by the runtime library is provided only with an in-memory virtual-file-system provider. That provider only contains specific in-memory files with data from the running binary’s compiler-created global variables.

4.1. CUDA Support

To support JIT compilation of CUDA code, which, in interesting cases, will require JIT compilation of CUDA kernels, dedicated logic exists in ClangJIT’s runtime library. First, the AoT-compilation might have compiled device code for multiple device architectures (e.g., for sm_35 and sm_70). The CUDA runtime library is queried in order to determine the compute capability of the current device, and based on that, ClangJIT’s runtime selects which device compiler state to use for JIT compilation. As for the host, the serialized AST, command-line options, and optimized LLVM IR are loaded to create a device-targeting compiler instance. During the execution of Algorithm 1, at a high level, whatever is done with the host compiler instance is also done with the device compiler instance. However, instead of taking the optimized IR and handing it off to LLVM’s JIT engine, the device compiler instance is configured to run the NVPTX backend and generate textual PTX code. This PTX code is then wrapped in an in-memory CUDA fatbin file222Unfortunately, the format of the fatbin files is not documented by NVIDIA, but given the information available in https://reviews.llvm.org/D8397 and in (Diamos et al., 2010), we were able to create fatbins that are functional for this purpose., and that fatbin file is provided to the host compiler instance to embed in the generated module in the usual manner for CUDA compilation.

5. Initial Evaluation

We present here an evaluation of ClangJIT on three bases: ease of use, compile-time reduction, and runtime-performance improvement. First, we’ll discuss the performance of ClangJIT by making use of a microbenchmark333The microbenchmark presented here is an adapted version of https://github.com/eigenteam/eigen-git-mirror/blob/master/bench/benchmark.cpp. which relies on the Eigen C++ matrix library (Guennebaud et al., 2010). Consider a simple benchmark which repeatedly calculates $M=I+5\times 10^{-5}(M+M^{2})$ for a matrix $M$ of some $NxN$ size. We’re interested here in cases where $N$ is small, and examples from real applications where such small loop bounds occur will be presented in the following sections. Listing 4 excerpts the version of the benchmark where dynamic matrix sizes are handled in the traditional manner. Listing 5 excerpts the version of the benchmark where code for dynamic matrix sizes is generated at runtime using JIT compilation.

The Eigen library was chosen for this benchmark because the library supports matrices of both compile-time size (specified as non-type template parameters) and runtime size (specific as constructor arguments). When using JIT compilation, we can use the non-type-template-parameter method to specify sizes known only during program execution. First, we’ll examine the compile-time advantages that ClangJIT offers over both traditional AoT compilation and over the up-front compilation of numerous potentially-used template specializations. In Figure 2, we present the AoT compilation time for this benchmark in various configurations444This benchmarking was conducted on a Intel Xeon E5-2699 using the flags -march=native -ffast-math -O3, and using a ClangJIT build compiled using GCC 8.2.0 with CMake’s RelWithDebInfo mode. For all of these times, we subtracted a baseline compilation time of 2.58s - the time required to compile a trivial main() function with the Eigen header file included. The time identified by ”J” indicates the AoT compilation time for the benchmark when only the JIT-compilation-based implementation is present (i.e., that in Listing 5 plus the associated main function). The time identified by ”A1” indicates the compilation time for the generic version, compiled only using the scalar type double (i.e., Listing 4 but with only the double case present). As can be seen, this takes significantly longer than the AoT compilation time for the JIT-based version. Moreover, the JIT-based version can be used with any scalar type. If we try to replicate that capability with the traditional AoT approach, and thus instantiate the implementation for float, double, and long double, as shown in Listing 4, then the AoT compilation time is nearly 7x larger than using the JIT capability. The reported ”J” time does omit the time spent at runtime compiling the necessary specializations. Here we show the compilation time for different specializations (i.e., with both the size and type fixed), for a scalar type of double, where the size was 16 for ”S16”, the size was 7 for ”S7”, the size was 3 for ”S3”, and size was 1 for ”S1”, and specializations for both sizes 16 and 7 were included in the time ”S16a7” (to demonstrate that the specialization compilation times are roughly additive). It is useful to note that the specialization compilation time depends on the size, but is always less than the single size-generic implementation in ”A1”. As the size becomes larger, the difference in the work the compiler must do to handle the specialization compared to handling the generic version shrinks.

Figure 3 shows the performance of the benchmark, using type double, for several sizes. For small sizes, when the size is known, the compiler can unroll the loops and perform other optimizations to produce code that performs significantly better than the generic version. This performance is essentially identical to the performance of the AoT-generated specializations of the same size, but of course, the JIT-based version is more flexible. As the size gets larger, the code the compiler generates becomes increasingly similar to that generated for the generic version, and differences such as generated tail loops come to have a decreasingly-important performance impact, and so for large sizes little performance difference remains between the JIT-compiled code and the generic version.

Figure 4 shows the performance of the benchmark, using type double, adapted to use CUDA555The CUDA benchmark was run on an IBM POWER8 host with an NVIDIA Tesla K80 GPU using CUDA 9.2.148.. The source code for the CUDA adaptation is straightforwardly derived from the original where the the matrix computation is moved into a kernel, and that kernel is executed using one GPU thread666A call to cudaThreadSetLimit(cudaLimitMallocHeapSize, $\ldots$ ) was inserted in the non-JIT implementation to allow dynamic allocation to work on the device.. This is meant to serve as a proxy for part of a larger computation, presumably running on many threads, and so we look at two metrics: First, we look at the runtime performance of the JIT-compiled kernel compared to the generic AoT-compiled kernel, and second, we look at the number of registers used by the various kernels. Using a large number of registers can limit GPU occupancy, and so when considered in the context of a larger calculation, if the JIT-compiled kernel uses a smaller number of registers than the AoT-compiled generic implementation, that adds additional performance benefits. As shown in Figure 4, for matrix sizes $1x1$ through $7x7$ , the serial performance of the JIT-compiled GPU kernels is one to two orders of magnitude better than the generic AoT-compiled version777The compilation of the Eigen matrix type for sizes larger than $7x7$ failed, because some required template specializations were not available, a problem that did not occur when compiling for the host, and so $7x7$ is the largest size shown.. As shown in Figure 5, on top of that, the smaller JIT-compiled kernels used far fewer registers than the generic AoT-compiled version888Repeating the experiment with type float shows the JIT-compiled kernels always use fewer registers than the AoT-compiled generic version. Loop unrolling and other compiler optimizations can increase register pressure, so while the specialized code can use many fewer registers, specialization is not guaranteed to reduce register usage..

In the following we present a case study on using JIT compilation transparently, and then two case studies where ClangJIT was used to improve two open-source proxy applications developed by Lawrence Livermore National Laboratory (LLNL): Kripke and Laghos. These proxy applications make use of the RAJA library (Hornung and Keasler, 2014), also developed at LLNL, as do the corresponding production applications. Care was taken to ensure that the techniques used to adapt these proxy applications to make use of ClangJIT can be applied to the production applications that they represent. Both compile-time and runtime performance data was collected with the assistance of the Caliper (Boehme et al., 2016) tool.

5.1. RAJA and Transparent Usage

The RAJA library aims to abstract underlying parallel programming models, such as OpenMP and CUDA, allowing portable applications to be created by making use of RAJA’s programming-model-neutral multi-dimensional-loop templates. It is also possible to hide the use of ClangJIT’s JIT-compilation behind RAJA’s abstraction layer, allowing applications to make use of JIT compilation without changes to the application’s source code. To illustrate how this works, consider the simple loop abstraction shown in Listing 6.

A user might desire that the loop is compiled at runtime, allowing the optimizer to specialize the code for the particular index range provided. As shown in Listing 7, this can be done by wrapping the template using one marked for JIT compilation, and in doing so, allows using JIT compilation without changing the interface, the forall template in this illustration, used by the application.

This serves to illustrate how the JIT capability can be transparently used by an application using a RAJA-like abstraction. Doing this, however, exposes a number of tradeoffs. First, the JIT compilation itself takes time. Second, the process of looking up already-compiled template instantiations also has an overhead that can be significant compared to a compiled function doing very little computational work per invocation. In essentially all of the evaluations presented in this paper, it is this lookup overhead that is most important. To illustrate the interplay between these overheads and how a realistic RAJA abstraction around JIT compilation can be constructed, we present a simple matrix-multiplication benchmark in Listing 8 which uses the RAJA library abstraction in Listing 9. As can be seen, this abstraction is more complicated than that in Listing 7, because it uses parameter packs to handle RAJA multi-dimensional ranges. This abstraction code would be placed in the RAJA library, however, thus absolving the application developer from dealing with these complexities.

We present in Figure 6 performance data from around the transition to profitability for various small sizes, for various batch sizes (which change based upon over how many invocations the lookup of the already-compiled template instantiation is amortized), and for different numbers of total iterations (which change based upon over how much work the JIT compilation itself is amortized). As can be seen, the compilation time can be significant if not enough computational work is performed by the compiled code (as is the case by the $size=2x2$ runs with fewer than $10^{9}$ total iterations), and moreover, the instantiation lookup adds significant overhead (as can be seen by noting that the speedup of the runs with smaller batch sizes is generally smaller than those with the larger batch sizes).

5.2. Application: Kripke

Kripke (Kunen et al., 2015) is a proxy application representing three-dimensional deterministic transport applications. Such an application has many nested loops which iterate over directions, zones, and energy groups. On different architectures, the optimal nesting order of those loops changes, as do the layouts of many data structures within the application. Managing this in an abstract way pushes Kripke to use bleeding-edge features from RAJA and modern C++. Unfortunately, constraints of C++ make such abstractions unfriendly to write, and as shown in Listing 10, Kripke has to maintain a lot of plumbing code to select among all the possible layouts and loop-execution mechanisms based on runtime user selections. This also means that every possible variant of the loop has to be compiled ahead of time.

With a JIT compiler, however, runtime user selections are trivial, and as illustrated in Listing 11, these selections can be replaced with a much simpler dispatch mechanism, and the compilation of these selections happens only as needed. Moreover, removing the extra kernel variants speeds up the AoT compilation of many files by 2-3x999Specifically, the files Scattering.cpp (68% compilation-time decrease), LTimes.cpp (58% compilation-time decrease), and LPlusTimes.cpp (56% compilation-time decrease), from which specializations were removed and replaced by uses of JIT compilation, exhibited AoT-compilation-time improvements., and leads to code that is easier to write and maintain.

5.3. Application: Laghos

Laghos (Laghos Team, 2019) is a higher-order finite-element-method (FEM) proxy application solving the time-dependent Euler equations of compressible gas dynamics in a moving Lagrangian frame using unstructured high-order finite-element spatial discretization (Dobrev et al., 2012) and explicit high-order time-stepping. Higher-order FEM codes are unlike traditional codes solving partial sdifferential equations in that, in addition to a traditional loop over all of the elements in a simulation, there are nested loops within each element whose bounds are determined at runtimes and often small. The relevant parameters (NUM_DOFS_1D and NUM_QUAD_1D in Listing 12) are often between two and 32, and they form the bounds on over thirty loops in just one kernel, often to depths of four within the main element loop. Knowing these loop bounds is critical, as shown in Figure 7: not knowing them can cause an 8x slowdown.

Currently, as shown in Listing 12, this performance opportunity is realized by picking the most common orders, instantiating a template function for each, and then dispatching among these explicitly-specialized functions. Since these codes already use template parameters, porting them to use ClangJIT was trivial. We match the performance of the template solution, which is 8x faster than the non-template solution, and make AoT compilation 10x faster. In our experience, higher-order finite-element codes use runtime specialization very frequently, and we believe that JIT compilation will play a huge part in doing this in a more-productive way in the future.

6. Future Work

Future enhancements to ClangJIT will allow for not just runtime specialization, but adaptive runtime specialization. LLVM’s optimizer can make good use of profiling information to guide function inlining, code placement, and other optimizations. ClangJIT can be enhanced to gather profiling information and use that information to recompile further-optimized code variants. LLVM supports two kinds of profiling: instrumentation-based profiling and sampling-based profiling, and how to best use this capability is under investigation. Moreover, profiling information can be used to guide more-advanced autotuning of the JIT-compiled code, and this is also under investigation.

LLVM contains a number of features that, while not used in support of traditional programming languages (e.g., C, C++, and Fortran) because of the corresponding runtime support required, are used to support dynamic languages, and could be applied to code compiled by ClangJIT at runtime. Implicit null checks, for example, which allow compiled code to elide nearly-always-false null-pointer checks by having the runtime library appropriately handle signals raised by rare memory-access violations, is a good candidate for inclusion in ClangJIT. Applying implicit null checks is dependent on having access to profiling information showing that the relevant checks essentially always succeed because they make the non-null case faster by making the null case very slow. Other LLVM features, such as those supporting dynamic patching and deoptimization, might also eventually be used with ClangJIT.

We will investigate enhancing ClangJIT with support for other accelerator programming models, especially HIP and OpenMP. HIP is a programming model for AMD GPUs, and its implementation in Clang is based on Clang’s CUDA implementation. Because of the similarities between HIP and CUDA, we anticipate the changes necessary to add HIP support will be minor. OpenMP accelerator support, however, is more complicated. Even when targeting NVIDIA GPUs, Clang’s OpenMP target-offload compilation pipeline differs significantly from its CUDA compilation pipeline. Among other differences, for OpenMP target offloading, the host code is compiled first, followed by the device code, and a linker script is used to integrate the various components later. Further enhancements to the scheme described in Section 3 are necessary in order to support a model where host compilation precedes device compilation.

During our explorations of different application use cases, two noteworthy enhancements were identified. First, C++ non-type template parameters currently cannot be floating-point values. However, for many HPC use cases, it will be useful to create specializations given specific sets of floating-point values (e.g., for some polynomial coefficients or constant matrix elements). Currently, this can be done by casting the floating-point values to integers, using those to instantiate the templates, and then casting the integers back to floating-point values inside the JIT-compiled code. The casting, however, is a workaround that we’ll investigate eliminating. Second, the current implementation does not allow JIT-compiled templates to make use of other JIT-compiled templates - in other words, once ClangJIT starts compiling code at runtime, it assumes that code will not, itself, create more places where JIT compilation might be invoked in the future. This has limited applicability to some use cases because it creates an unhelpful barrier between code that can be used during AoT compilation and code that can be used (transitively) in JIT-compiled templates.

This work has inspired a recent proposal to the C++ standards committee (Finkel, 2019), and as discussed in that proposal, the language extension discussed here might not be the best way to design such an extension to C++. As highlighted in that proposal, use of an attribute such as [[clang::jit]] might be suboptimal because:

•

The template is not otherwise special, it is the point of instantiation that is special (and the point of instantiation is what might cause compilation to fail without JIT-compilation support).

•

The uses of the template look vaguely normal, and so places where the application might invoke the JIT compiler will be difficult to spot during code review.

•

The current mechanism provides no place to get out an error or provide a fall-back execution path - except that having the runtime throw an exception might work.

Future work will explore different ways to address these potential downsides of the language extension presented here. For example, it might be better to use some syntax like:

⬇

jit_this_template foo<argc>();

where jit_this_template would be a new keyword.

7. Conclusions

In this paper, we’ve demonstrated that JIT-compilation technology can be integrated into the C++ programming language, that this can be done in a way which makes using JIT compilation easy, that this can reduce compile time, making application developers more productive, and that this can be used to realize high performance in HPC applications. We investigated whether JIT compilation could be integrated into applications making use of the RAJA abstraction library without any changes to the application source code at all and found that we could. We then investigated how JIT compilation could be integrated into existing HPC applications. Kripke and Laghos were presented here, and we demonstrated that the existing dispatch schemes could be replaced with the proposed JIT-compilation mechanism. This replacement produced significant speedups to the AoT compilation process, which increases programmer productivity, and moreover, the resulting code is simpler and easier to maintain, which also increases programmer productivity. In the future, we expect to see uses of JIT compilation replacing generic algorithm implementations in HPC applications which represent so many potential specializations that instantiating them during AoT compilation is not practical. We already see this in machine-learning frameworks (e.g., in TensorFlow/XLA) and other domain-specific libraries, and now ClangJIT makes this capability available in the general-purpose C++ programming language. The way that HPC developers think about the construction of optimized kernels, unlike in the past, will increasingly include the availability of JIT compilation capabilities.

Acknowledgements.

This research was supported by the Exascale Computing Project (Grant #17-SC-20-SC), a collaborative effort of two Sponsor U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Additionally, this research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract Grant #DE-AC02-06CH11357. Work at Lawrence Livermore National Laboratory was performed under Contract Grant #DE-AC52-07NA27344 (LLNL-CONF-772305). We would additionally like to thank the many members of the C++ standards committee who provided feedback on this concept during the 2019 committee meeting in Kona. Finally, we thank Quinn Finkel for editing and providing feedback.

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (”Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. http://energy.gov/downloads/doe-public-access-plan

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abadi et al . (2017) Martín Abadi, Michael Isard, and Derek G Murray. 2017. A computational model for Tensor Flow: an introduction. In Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages . ACM, 1–7.
3Bezanson et al . (2012) Jeff Bezanson, Stefan Karpinski, Viral B Shah, and Alan Edelman. 2012. Julia: A fast dynamic language for technical computing. ar Xiv preprint ar Xiv:1209.5145 (2012).
4Binks (2019) Doug Binks. 2019. Runtime Compiled C & C++ Solutions. https://github.com/Runtime Compiled C Plus Plus/Runtime Compiled C Plus Plus/wiki/Alternatives .
5Binks et al . (2013) Doug Binks, Matthew Jack, and Will Wilson. 2013. Runtime compiled c++ for rapid ai development. Game AI Pro: Collected Wisdom of Game AI Professionals (2013), 201.
6Boehme et al . (2016) David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew Le Gendre, Olga Pearce, and Martin Schulz. 2016. Caliper: performance introspection for HPC software stacks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE Press, 47.
7Byler (1987) Mark Byler. 1987. Multiple version loops. In Proceedings of the 1987 International Conference on Parallel Processing. New York .
8Caamaño and Guelton (2018) Juan Manuel Martinez Caamaño and Serge Guelton. 2018. Easy::Jit: compiler assisted library to enable just-in-time compilation in C++ codes. In Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming . ACM, 49–50.