Tools OpenACC Lead image: Lead Image © orson, 123RF.com

Using loop directives to improve performance

Parallelizing Code

OpenACC is a great tool for parallelizing applications for a variety of processors. In this article, I look at one of the most powerful directives, parallel loop. By Jeff Layton

In the last half of 2018, I wrote about critical high-performance computing (HPC) admin tools [1]. Often, HPC admins become programming consultants by helping researchers get started with applications, debug the applications, and improve performance. In addition to administering the system, then, they have to know good programming techniques and what tools to use.

MPI+X

The world is moving toward exascale computing – at least 1018 floating-point operations per second (FLOPS) – at a rapid pace. Even though most systems aren't exascale, quite a few are at least petascale (>1015 FLOPS) and use a large number of nodes. Programming techniques are evolving to accommodate petascale systems while getting ready for exascale. Meanwhile, a key programming technique called MPI+X refers to using the Message Passing Interface (MPI) in an application for data communication between nodes while using something else (the X) for application coding within the node.

The X can refer to any of several tools or languages, including the use of MPI across all nodes (i.e., MPI+MPI), which has been a prevalent programming technique for quite a while. Classically, each core assigned to an application is assigned an MPI rank and communicates over whatever network exists between the nodes. To adapt to larger and larger systems, data communication has adapted to use multiple levels of communication. MPI ranks within the same node can communicate directly without a network interface card (NIC). Ranks that are not on the same physical node communicate through NIC. Networking techniques can take advantage of specific topologies to reduce latency, improve bandwidth, and improve scalability.

Directives

A popular X category is the directive [2], which includes OpenMP and OpenACC, both of which were formed to standardize on directives that are not specific to a machine, operating system, or vendor. Directives are also referred to as "pragmas" and instruct the compiler to perform certain code transformations before compiling the resulting code.

If the compiler doesn't understand the directive (pragma), it will ignore it. This feature is important because it allows for a single codebase, reducing the likelihood of adding errors to the code. For example, you can place OpenMP directives in your serial code and still run the code in either serial mode or parallel mode, depending on your compiler setting. In C/C++ code, a pragma will look like #pragma token-string. For instance,

#pragma omp parallel for

might be all that's needed to parallelize a simple for loop. In this article, I look at OpenACC, a directives-based approach to parallelizing code and improving code performance.

OpenACC

OpenACC was originally developed to add accelerator device support that was missing from OpenMP. The design goals for OpenACC are a bit different from OpenMP. OpenACC takes a descriptive approach by using directives to describe the properties of the parallel region to the compiler, which then generates the best code possible to meet the description on which you plan to run. The goal of OpenACC was to support a wide range of accelerators, including multicore CPUs. Currently it supports:

POWER CPU
Sunway
x86 CPU
x86 Xeon Phi
Nvidia GPU
PEZY-SC

As with OpenMP, OpenACC allows you to use a single codebase, which can reduce errors from the introduction of new code. To compilers, the directives just look like comments. OpenACC uses parallel directives (regions that are parallelizable), data directives (data movements to/from the accelerator devices), and clauses. Fundamentally, OpenACC requires that the parallel loop be free of any data dependencies, which sometimes requires loops to be rewritten. When such a code refactoring is required, the resulting code often runs faster both with and without the directives.

OpenACC breaks the work into smaller pieces depending on the directives used in the code and the target architecture for the code. The run-time environment will select how that code is mapped to gangs, which are essentially a group of threads that can neither synchronize nor share data, on the target architecture. For example, on CPUs, they are mapped to cores. For GPUs, they are mapped to the GPU processors. For more parallelism, OpenACC can also use multiple gangs or combinations of gangs and lower level parallelism (to be covered later).

Parallel Computing

Classically, applications were written to be executed serially. One computation is performed after another. But this approach doesn't take into account that some computations or regions can be computed simultaneously. Finding and parallelizing such regions of an application allows it to run faster and scale better than serial applications (see Amdahl's law [3]).

Today's processors have multiple cores, and accelerators such as GPUs have thousands of lightweight cores that can be used, as well. At a simplistic level, parallelization breaks a problem into discrete parts that can be solved simultaneously. Each part is then executed on different cores but with some sort of coordination.

One likely place for parallelization to occur is in loops. In this simple Fortran loop

do i = 1,n
   z(i) = x(i) * y(i)
enddo

each value z(i) is not dependent on previous values of z(i). Therefore, all values of z(i) can be computed in any order or at the same time. If the upper limit of the loop, n, is large enough, some processing hardware can greatly speed up the computation.

What happens if z(i) depends on a previous value, as in the following:

do i = 2,n
   z(i) = z(i-1)*2
enddo

As written, you can't generally parallelize the loop because of data dependency [4]. This dependency is also called loop-level parallelism. However, for this particular example, you could rewrite the loop in a way that can be parallelized:

do i = 2,n
   z(i) = z(i)*2**(i-1)
enddo

When the compiler tells you a loop cannot be parallelized, you, the programmer, will need to determine whether it can be refactored in such a way that it can be parallelized.

In other situations, you might have to pay attention to race conditions, mutual exclusion, and parallel slowdown. For the purposes of this article, I won't cover these situations or conditions.

OpenACC Introduction

OpenACC has a number of directives that can be put into code that give the compiler information about how you would like to parallelize the code. The syntax's general form is shown in Table 1, but before getting too deep into directives, I'll look at what is happening in the code with the parallel directive.

Tabelle 1: Parallelizing Fortran and C

Fortran	C
`!$acc <directive clauses>` `< code >`	`:#pragma acc <directive clauses>` `< code >`

The parallel directive is a way to express parallelism in your code. Table 2 is a very simple example of getting started with directives in both Fortran and C.

Tabelle 2: parallel Directive

Fortran	C
`!$acc parallel` `< parallel code >!$acc end parallel`	`#pragma acc parallel{` `< parallel code >}`

The directives for Fortran and C begin with comment characters in their respective languages so that a non-OpenACC-compliant compiler will ignore them.

In Fortran, immediately after the comment (!) is $acc, which informs the compiler that everything that follows is an OpenACC directive. In C, immediately after the comment (#) is pragma, informing the compiler that some sort of directive follows. After pragma, the acc tells the compiler that everything after it is an OpenACC directive. Inserting directives is pretty straightforward. The only piece of advice is not to forget the $ before acc in Fortran or the #pragma before acc in C.

One of the neat things about OpenACC is that directives can be added incrementally. You should always start with working serial code (with or without MPI); then, you can start adding OpenACC directives to explore parallelism. Along with working serial code, you absolutely need a way to verify whether the output from the application is correct. If the output is verified as correct, then you can annotate more of the code with directives.

Annotating code is a process. You may find occasions when adding a directive will cause the code to slow down. Do not worry. As you continue studying your code and annotating it, you will find the reasons for slowdowns and be able to correct them. This happens frequently when code written for GPU machines (because of the need to move memory between the CPU and the GPU) is fixed as you add directives incrementally.

OpenACC parallel loop Directives

A key to understanding OpenACC is understanding what happens when a parallel directive is encountered. This explanation will extend to other OpenACC directives, as well.

When the compiler encounters a parallel directive, it will generate the appropriate parallel code. In the case of OpenACC on a CPU-only system, it could be a single thread and, most likely, one thread per core – unless the processor supports simultaneous multithreading (SMT). On a GPU, it can be a collection of processing elements (threads) that can number into the hundreds. In OpenACC parlance, these groups of processing elements are referred to as a "gang". These gangs, unless told otherwise, will execute the code redundantly. That is, each gang executes the exact same code (Figure 1). Processing proceeds from the top down for each gang.

Figure 1: Parallel directives and gangs (from OpenACC.org).

Inside the parallel directive, if you have a loop as shown in Table 3, each gang will execute each loop redundantly (i.e., each gang will run the exact same code). These gangs are executed at the same time but independently of one another (Figure 2).

Tabelle 3: Gang Execution

Fortran	C
`!$acc parallel` `do i=1,n` `! do something` `enddo!$acc end parallel`	`pragma acc parallel{` `for (int i=0; i < n; i++)` `{` `# do something` `}}`

Figure 2: Gangs running loops (from OpenACC.org).

Having each gang compute the exact same thing is not an efficient use of resources. Adding a loop directive to the code (Table 4) tells the compiler that the loop code can be run in parallel across the gangs.

Tabelle 4: loop Directive

Fortran	C
`!$acc parallel` `!$acc loop` `do i=1,n` `a(i) = 0.0` `enddo!$acc end parallel`	`#pragma acc parallel{` `#pragma acc loop` `for (int i=0; i < n; i++)` `{` `a[i] = 0.0` `}}`

When the compiler encounters the parallel directive, it gets ready to create parallelized code and loops for OpenACC parallelization directives by informing the compiler which loops to parallelize. In the above code, it encounters the loop code and creates parallelized code that is split across the gangs or threads as evenly as possible.

Adding only two lines to the code tells the compiler that you have a parallel region and a loop that can be parallelized. That is all you have to do. At this point, the compiler determines how best to parallelize the loop given the target processor architecture. This is the power of OpenACC.

Notice in Figure 3 that each gang runs a number of threads. All of this parallelization is created by the compiler using the directives in the code. A best practice is to combine parallel and loop in one directive covering the loops that are to be optimized. In Fortran this would be !$acc parallel loop, and in C it would be #pragma acc parallel loop.

Figure 3: Gangs running parallelized loop code (from OpenACC.org).

Because in OpenACC the compiler uses the directives to parallelize the code, it will have a difficult time parallelizing if (1) the loop can't be parallelized or (2) you don't give it enough information to make a decision about parallelizing a loop. The compiler implementing the OpenACC standard should err on the side of caution and not try to parallelize a loop if it detects problems or if it is not sure.

If you have code with a loop that the compiler cannot parallelize (e.g., Listing 1), you might see compiler output like:

437, Complex loop carried dependence of e_k prevents parallelization
Loop carried reuse of e_k prevents parallelization

Listing 1: Unparallelizable Code

do k=1,9
   e_k(1) = real(e((k-1)*2+1))
   e_k(2) = real(e((k-1)*2+2))
   DP = e_k(1)*U_x(i,j) + e_k(2)*U_y(i,j)
   f_eq = w(k)*rho(i,j)*(1.0+3.0*DP+(9.0/2.0)*(DP**2.0)-(3.0/2.0)*DPU)
   f(i,j,k)=f(i,j,k)-(1.0/tau)*(f(i,j,k)-f_eq)
enddo

Although it is a very small loop of seven lines, it illustrates data dependencies that can cause the compiler to refuse to parallelize the loop(s).

OpenACC Programming Approach

In this article, I discussed only one directive: parallel loop. You can affect performance a great deal if you look for loop parallelism in your code and start using this directive. There is a recommended approach to using OpenACC directives, including the parallel loop directive.

Although I do not want to change your coding style, ideally, adding directives to code should be driven by profiling and tracing. Profiling determines the routines where most of the run time is spent, expressed as a simple table: the routine name and how much time was spent in that routine. Then you stack-rank the times, and the top routines are the initial primary focus. Tracing is a timeline examination of what is happening in the application. In the case of accelerators, this includes data movement to and from accelerators.

With the initial list of target routines in hand, you can start adding directives to your code. Generally, adding only one directive at a time is recommended. By incrementally adding directives, you can understand the effect each makes on run time. You add a directive, rebuild the code, run the code, test it to make sure the answers are correct, and then look at the effect on performance.

While adding loops to your code, if run time goes up, don't despair. This can happen because of bad data movement to and from the CPU and the accelerator. (I will cover this topic in an upcoming article.) You need to focus on the accuracy of the output. If the output is not correct, then you might need to change the directive, change the code, or even drop the parallel loop directive from that portion of the code.

Summary

OpenACC directives allow you to take serial code and "port" it to multicore CPUs or accelerators such as GPUs. These directives appear to be comments to the compiler, unless the compiler understands the directives, which allows you to use one version of code – reducing the chance of errors and keeping code size down – and build it with your usual compiler. If a compiler understands OpenACC, then simply adding specific flags to your compile line will allow you to build and run with multiple CPU cores or accelerators.

Performance improvements are achieved by locating regions that can be parallelized in your application. A classic approach is to find loops that can be parallelized. This article tackled the parallel loop OpenACC directive for both Fortran and C. A best practice is to combine parallel and loop in one directive (i.e., !$acc parallel loop) for a loop nest you want to parallelize. Some "clauses" can be used to add more specificity to the directive to help the compiler better generate accelerator code. These clauses aren't covered in this article but can be found online.

When using loop parallelization, it is best practice to focus on the routines that use the most run time. In this fashion, you'll make the biggest dent in run time with the fewest directives. A simple profile of the application can provide you with the stack rank of the most time consuming routines.

As you add directives to your application, be sure to check the output. It should match the application output when the application isn't built with directives. This is absolutely vital to improving your application. Running fast but not getting the correct output is worthless. Running on processors other than CPUs can produce slightly different output. Therefore you don't want to do a bit-for-bit comparison; rather, you want to compare the output for significant differences. Defining what is "significant" is up to the user and the application, but should not be taken lightly.

OpenACC has a number of directives other than parallel loop to help you port applications. However, these directives allow you to start attacking "hot spots" in your code immediately to improve performance. Learning just two OpenACC clauses in exchange for improving parallelization through loops isn't a bad trade.

In the next OpenACC article, I discuss data usage, focusing on how you can consider it in combination with parallel loops to get even better performance.