Nuts and Bolts Preload Trick 
 

Improving performance with environment variables

Trick or No Trick

By using the LD_PRELOAD environment variable, you can improve performance without making changes to applications. By Jeff Layton

A topic that system administrators learn as they gain experience is called the "LD_PRELOAD Trick." This trick can help fix misbehaving applications, upgrade applications, and even improve application performance. Of course, it is not really a trick, just the use of a feature in *nix operating systems.

Have you ever installed an application on Linux and tried to run it only to be told the application can't be found? To debug the issue, probably the first thing to check is your PATH [1], which is "an environment variable … that tells the shell which directories to search for executable files." In short, the path tells Linux where to look for applications. If the application is not in the path, then Linux "thinks" it does not exist.

Fortunately, environment variables in Linux can be changed. If Linux cannot find an application, you can change the environment variable, edit the environment variable, or add to the environment variable. Linux uses other environment variables for defining other aspects of the operating system beyond the location of executables. Many times, applications define and use their own environment variables. Users can even define their own environment variables.

In addition to PATH, which helps locate applications, the LD_LIBRARY_PATH environment variable tells Linux where to search for the shared libraries used by applications, which allows you to control which libraries are "available." Like PATH, this variable can be changed, and each shell can have its own value.

The variable can be useful when debugging a new library because you can simply change LD_LIBRARY_PATH to the new library, test it, and then change it back. You can also use it when upgrading libraries. If there is little or no change to the APIs in the new library, then a simple change to LD_LIBRARY_PATH allows you to use the new library without changing anything else.

A third environment variable that also works with libraries, and is at the heart of the "trick," is LD_PRELOAD, an environment variable that contains a delimited list of shared objects (libraries) [2] that are loaded before all others. This variable allows you to have more control over the order that libraries are found by the application than just LD_LIBRARY_PATH.

LD_PRELOAD can be a great help in debugging because you can set it to a new library without changing LD_LIBRARY_PATH. After debugging, just set LD_PRELOAD to its previous value.

Perhaps the greatest strength of LD_PRELOAD is that you can easily substitute a new library for an existing one, allowing you to upgrade a library in an attempt to get better performance. Inserting a library before another for whatever purpose you have in mind is the so-called LD_PRELOAD trick.

One use I've seen of LD_PRELOAD is to load a library that intercepts calls to a normal library. The "intercept library" uses the same symbols (functions) as the usual library so that it will intercept any function calls from the application that were intended for that library. This intercept library can then be used to gather telemetry information from the calling application, perhaps writing it to a file. The intercept library then calls the intended functions in the usual library. With LD_PRELOAD, you can load the intercept library before the usual library without having to change it or the application.

A classic use case for an intercept library is for gathering telemetry (information) about I/O functions. With LD_PRELOAD, the intercept library intercepts I/O function calls such as open(), close(), read(), and write() to gather information and then passes the function calls to the intended I/O library. The intercept library uses the same function names, but rather than rewrite the I/O functionality for these functions, the new library typically gathers information, writes it to a file, and then calls the normal library to perform the I/O functions. Although this example is a classic use case of LD_PRELOAD, it is not the only use case. The next section presents another use of LD_PRELOAD resulting in increased performance.

Octave

Probably one of the best examples I know for the use of the LD_PRELOAD trick is to push basic linear algebra subprogram (BLAS) [3] computations from a CPU onto an NVidia GPU. I will illustrate this with an example from Octave [4], a mathematics tool similar to Matlab [5].

To demonstrate the process, I'll use two Octave scripts: The first does a simple square matrix multiply in single precision for various matrix sizes (Listing 1). The second script (Listing 2) is the same as Listing 1, but uses double precision.

Listing 1: Single-Precision Square Matrix Multiply

# Example SGEMM
for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
   A = single( rand(N,N) );
   B = single( rand(N,N) );
   start = clock();
   C = A*B;
   elapsedTime = etime(clock(), start);
   gFlops = 2*N*N*N / (elapsedTime * 1e+9);
   disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ...
                N, elapsedTime, gFlops) );
endfor

Listing 2: Double-Precision Square Matrix Multiply

# Example DGEMM
for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
   A = double( rand(N,N) );
   B = double( rand(N,N) );
   start = clock();
   C = A*B;
   elapsedTime = etime(clock(), start);
   gFlops = 2*N*N*N / (elapsedTime * 1e+9);
   disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ...
                N, elapsedTime, gFlops) );
endfor

To begin, I'll run these scripts on a test system with the default BLAS library that comes with Octave; then, I can use the LD_PRELOAD trick to have Octave call a different BLAS library, resulting in different, conceivably better, performance.

The test system is my Linux laptop (see Table 1 for specifications). The laptop runs Ubuntu 20.04 with the 455.45.01 NVidia driver, and CUDA 11.2. Octave 5.2.0 was used for the tests. All software was installed from the Apt repository for the specific distribution version.

Tabelle 1: Test System Specs

CPU: Intel Core i5-10300H CPU [6] @2.50GHz

Processor base frequency 2.5GHz

Max turbo frequency 4.5GHz

Cache 8MB

Four cores (eight with hyper-threading)

45W TDP

8GB DDR4-2933 memory

Maximum of two memory channels

Memory bandwidth 45.8GBps

NVidia GeForce 1650 GPU [7]

Architecture: Turing (TU117)

Memory 4GB GDDR5

Memory speed 8bps

Memory bandwidth 128GBps

Memory bus 8-bit

L2 cache 1MB

TDP 75W

Base clock 1,485GHz

Boost clock 1,665MHz

896 CUDA cores

The two scripts were run several times (>15) for each case to get a feel for the performance; then, they were run for the results presented in this article.

Default BLAS Library

By default, Octave uses a multithreaded BLAS library. Specifically, Octave used the BLAS library located at /lib/x86_64-linux-gnu/libblas.so.3. The two scripts, one for single precision and one for double precision, were run under the default BLAS library. The straightforward command to run the single-precision code with all cores (the default) is:

$ octave-cli ./sgemm.m

To run with a single core, you modify the command slightly:

$ OMP_NUM_THREADS=1 octave-cli ./sgemm.m

The results for running the two scripts are presented in Table 2 (where GFLOPS is a billion floating-point operations per second). First, they are run on a single core, and then on all cores. A fair amount of variability is evident for N=256 and N=512, which is also true for all subsequent CPU results.

Tabelle 2: Octave Results with Default BLAS Library

Single-Precision, One Core

Double-Precision, One Core

Single-Precision, All Cores

Double-Precision, All Cores

N

Elapsed Time (secs)

GFLOPS

Elapsed Time (secs)

GFLOPS

Elapsed Time (secs)

GFLOPS

Elapsed Time (secs)

GFLOPS

2

0.000702

0.000023

0.000427

0.000037

0.000961

0.000017

0.000137

0.000117

4

0.000069

0.001864

0.000076

0.001678

0.000099

0.001291

0.00092

0.001398

8

0.000069

0.014913

0.000061

0.016777

0.000092

0.011185

0.000084

0.012202

16

0.000061

0.134218

0.000061

0.134218

0.000092

0.089478

0.000084

0.097613

32

0.000076

0.858993

0.000076

0.858993

0.000099

0.660764

0.000107

0.613567

64

0.000099

5.286114

0.000145

3.616815

0.000153

3.435974

0.000206

2.545166

128

0.000313

13.408678

0.000587

7.139686

0.000565

7.429133

0.000473

8.867029

256

0.001785

18.795071

0.003654

9.181725

0.000542

61.944317

0.001144

29.32031

512

0.013779

19.481934

0.027763

9.668693

0.0047

57.117487

0.022438

11.963404

1,024

0.100395

21.390301

0.215065

9.985277

0.02961

72.526405

0.055252

38.867022

2,048

0.776039

22.137891

1.612694

10.652902

0.199173

86.256026

0.455025

37.755903

4,096

5.855209

23.472936

12.275261

11.196418

1.575951

87.21019

3.468651

39.623174

8,192

39.343849

27.946214

102.974144

10.677551

12.247917

89.771315

26.561623

41.394746

OpenBLAS

One of the most popular BLAS libraries is OpenBLAS [8], which you can use with the PRELOAD trick instead of the default BLAS library. The command to run the single-precision script is:

$ OMP_NUM_THREADS=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libopenblas.so.0 octave-cli ./sgemm.m

Table 3 contains the results. Note that the OpenBLAS library is installed from the Apt repository for this distribution and version. Likely, one built on the system could produce better results.

Tabelle 3: Octave Results with OpenBLAS Library

Single-Precision, One Core

Double-Precision, One Core

Single-Precision, All Cores

Double-Precision, All Cores

N

Elapsed Time (secs)

GFLOPS

Elapsed Time (secs)

GFLOPS

Elapsed Time (secs)

GFLOPS

Elapsed Time (secs)

GFLOPS

2

0.000114

0.00014

0.000114

0.00014

0.001022

0.000016

0.000771

0.000021

4

0.000076

0.001678

0.000076

0.001678

0.000099

0.001291

0.000061

0.002097

8

0.000061

0.016777

0.000061

0.016777

0.000092

0.011185

0.000061

0.016777

16

0.000061

0.134218

0.000069

0.119305

0.000084

0.097613

0.000076

0.107374

32

0.000061

1.073742

0.000076

0.858993

0.000092

0.715828

0.000076

0.858993

64

0.000099

5.286114

0.000137

3.817749

0.000145

3.616815

0.000137

3.817749

128

0.000313

13.408678

0.000572

7.330078

0.000381

10.995116

0.000656

6.392509

256

0.001808

18.557158

0.003624

9.259045

0.000519

64.677155

0.001144

29.32031

512

0.013237

20.279177

0.026962

9.955963

0.004074

65.888337

0.008163

32.882591

1,024

0.101677

21.120656

0.20388

10.533061

0.035118

61.150332

0.052483

40.918008

2,048

0.774956

22.168839

1.59137

10.79565

0.201546

85.240558

0.410416

41.859683

4,096

5.741043

23.939718

11.007278

12.486188

1.558258

88.20038

3.523735

39.003771

8,192

39.33165

27.954882

84.512154

13.010101

12.305489

89.351318

26.867691

40.92319

NVBLAS

NVidia has several libraries you can use when writing programs. Some of these libraries are standard conforming libraries, such as cuBLAS [9]. NVidia has taken cuBLAS and used it as part of a "drop-in" replacement BLAS library, NVBLAS, that provides BLAS level 3 routines [10]. NVBLAS uses cuBLAS, both of which are included as part of CUDA [11]; simply follow the directions for downloading and installing CUDA. For this article, I used the cuBLAS and NVBLAS that came with the NVidia HPC SDK, version 21.3.

Before using NVBLAS, you have to configure it. From the NVBLAS documentation [12], "It must be configured through an ASCII text file that describes how many and which GPUs can participate in the intercepted BLAS calls." To use NVBLAS, create the file nvblas.conf in the directory in which you are running the scripts. For the example in this article, the contents of the file I used were:

# This is the configuration file to use NVBLAS Library
NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so.0
NVBLAS_GPU_LIST 0
NVBLAS_AUTOPIN_MEM_ENABLED

The first line of the file defines the logfile where NVBLAS writes any log information. The next line defines the CPU-only BLAS library for cases in which there is no GPU routine. The code defaults to running on the CPU and falls through to the CPU BLAS library, which the NVBLAS_CPU_BLAS_LIB variable specifies for NVBLAS. In this case, I chose to use the OpenBLAS library.

The third line lists the GPU devices that should be used. The numbering begins with 0. In this case, the laptop only has one NVidia GPU, so only one is listed. You can also use the keyword ALL to define all the GPUs in the system. The last line is something I used from an article about NVBLAS with Octave [13]. After configuring nvblas.conf, you have to take two steps to run Octave. The first step is to export the NVBLAS_CONFIG_FILE environment variable that points to the location of the nvblas.conf file:

export NVBLAS_CONFIG_FILE=$HOME/PROJECTS/OCTAVE/nvblas.conf

This environment variable just points to the ASCII configuration file you created. The second step is the run command itself, which uses the LD_PRELOAD trick to load NVBLAS first:

LD_PRELOAD=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/math_libs/11.2/targets/x86_64-linux/lib/libnvblas.so.11.4.1.1026 octave-cli ./sgemm.m

The command begins by defining LD_PRELOAD, pointing to the NVBLAS library, which is then followed by the command that runs Octave (octave-cli). To run the script, you can simply concatenate the two commands together (I tend to write a one-line Bash script for this). The results for the single- and double-precision scripts are shown in Table 4.

Tabelle 4: Octave Results with the NVBLAS Library

Single-Precision, GPU

Double-Precision, GPU

N

Elapsed Time (secs)

GFLOPS

Elapsed Time (secs)

GFLOPS

2

0.001167

0.000014

0.001007

0.000016

4

0.000076

0.001678

0.000069

0.001864

8

0.000061

0.016777

0.000061

0.016777

16

0.000061

0.134218

0.000069

0.119305

32

0.000076

0.858993

0.000076

0.858993

64

0.000099

5.286114

0.000145

3.616815

128

0.000542

7.74304

0.000603

6.958934

256

0.000549

61.083979

0.001152

29.126136

512

0.016685

16.087962

0.012955

20.721067

1,024

0.008904

241.195353

0.039238

54.72975

2,048

0.01741

986.765913

0.250496

68.583432

4,096

0.093765

1465.776933

1.500099

91.619911

8,192

0.643051

1709.835418

12.03125

91.387979

The strange "blurp" in the results for N=512 I cannot explain, but it happens very frequently. Notice the strange results at N=256 and N=512 that also happened when using the CPU.

For the CPU results, the double-precision results are about half the single-precision results, which is expected. However, the GPU double-precision performance is less than half of the single-precision results, because the GPU used (the GeForce 1650) is a consumer-grade GPU with the focus primarily on 32-bit performance. However, as you can tell, it can run double-precision code, just not as well as the data center GPUs that focus on 64-bit performance.

Summary

The PRELOAD trick is something of a rite of passage for new system administrators. When they find out about the trick, it is a revelation because of how flexible it can be. Soon, it is no longer a trick but a part of what the admin uses every day. I hope the simple example of LD_PRELOAD in this article with GPUs for computation and without any code changes illustrates its utility.

If you knew of this trick but have forgotten it, or if you are just learning it, I hope this article proved useful.