Nuts and Bolts Preload Trick

Improving performance with environment variables

Trick or No Trick

By using the LD_PRELOAD environment variable, you can improve performance without making changes to applications. By Jeff Layton

A topic that system administrators learn as they gain experience is called the "LD_PRELOAD Trick." This trick can help fix misbehaving applications, upgrade applications, and even improve application performance. Of course, it is not really a trick, just the use of a feature in *nix operating systems.

Have you ever installed an application on Linux and tried to run it only to be told the application can't be found? To debug the issue, probably the first thing to check is your PATH [1], which is "an environment variable … that tells the shell which directories to search for executable files." In short, the path tells Linux where to look for applications. If the application is not in the path, then Linux "thinks" it does not exist.

Fortunately, environment variables in Linux can be changed. If Linux cannot find an application, you can change the environment variable, edit the environment variable, or add to the environment variable. Linux uses other environment variables for defining other aspects of the operating system beyond the location of executables. Many times, applications define and use their own environment variables. Users can even define their own environment variables.

In addition to PATH, which helps locate applications, the LD_LIBRARY_PATH environment variable tells Linux where to search for the shared libraries used by applications, which allows you to control which libraries are "available." Like PATH, this variable can be changed, and each shell can have its own value.

The variable can be useful when debugging a new library because you can simply change LD_LIBRARY_PATH to the new library, test it, and then change it back. You can also use it when upgrading libraries. If there is little or no change to the APIs in the new library, then a simple change to LD_LIBRARY_PATH allows you to use the new library without changing anything else.

A third environment variable that also works with libraries, and is at the heart of the "trick," is LD_PRELOAD, an environment variable that contains a delimited list of shared objects (libraries) [2] that are loaded before all others. This variable allows you to have more control over the order that libraries are found by the application than just LD_LIBRARY_PATH.

LD_PRELOAD can be a great help in debugging because you can set it to a new library without changing LD_LIBRARY_PATH. After debugging, just set LD_PRELOAD to its previous value.

Perhaps the greatest strength of LD_PRELOAD is that you can easily substitute a new library for an existing one, allowing you to upgrade a library in an attempt to get better performance. Inserting a library before another for whatever purpose you have in mind is the so-called LD_PRELOAD trick.

One use I've seen of LD_PRELOAD is to load a library that intercepts calls to a normal library. The "intercept library" uses the same symbols (functions) as the usual library so that it will intercept any function calls from the application that were intended for that library. This intercept library can then be used to gather telemetry information from the calling application, perhaps writing it to a file. The intercept library then calls the intended functions in the usual library. With LD_PRELOAD, you can load the intercept library before the usual library without having to change it or the application.

A classic use case for an intercept library is for gathering telemetry (information) about I/O functions. With LD_PRELOAD, the intercept library intercepts I/O function calls such as open(), close(), read(), and write() to gather information and then passes the function calls to the intended I/O library. The intercept library uses the same function names, but rather than rewrite the I/O functionality for these functions, the new library typically gathers information, writes it to a file, and then calls the normal library to perform the I/O functions. Although this example is a classic use case of LD_PRELOAD, it is not the only use case. The next section presents another use of LD_PRELOAD resulting in increased performance.

Octave

Probably one of the best examples I know for the use of the LD_PRELOAD trick is to push basic linear algebra subprogram (BLAS) [3] computations from a CPU onto an NVidia GPU. I will illustrate this with an example from Octave [4], a mathematics tool similar to Matlab [5].

To demonstrate the process, I'll use two Octave scripts: The first does a simple square matrix multiply in single precision for various matrix sizes (Listing 1). The second script (Listing 2) is the same as Listing 1, but uses double precision.

Listing 1: Single-Precision Square Matrix Multiply

# Example SGEMM
for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
   A = single( rand(N,N) );
   B = single( rand(N,N) );
   start = clock();
   C = A*B;
   elapsedTime = etime(clock(), start);
   gFlops = 2*N*N*N / (elapsedTime * 1e+9);
   disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ...
                N, elapsedTime, gFlops) );
endfor

Listing 2: Double-Precision Square Matrix Multiply

# Example DGEMM
for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
   A = double( rand(N,N) );
   B = double( rand(N,N) );
   start = clock();
   C = A*B;
   elapsedTime = etime(clock(), start);
   gFlops = 2*N*N*N / (elapsedTime * 1e+9);
   disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ...
                N, elapsedTime, gFlops) );
endfor

To begin, I'll run these scripts on a test system with the default BLAS library that comes with Octave; then, I can use the LD_PRELOAD trick to have Octave call a different BLAS library, resulting in different, conceivably better, performance.

The test system is my Linux laptop (see Table 1 for specifications). The laptop runs Ubuntu 20.04 with the 455.45.01 NVidia driver, and CUDA 11.2. Octave 5.2.0 was used for the tests. All software was installed from the Apt repository for the specific distribution version.

Tabelle 1: Test System Specs

CPU: Intel Core i5-10300H CPU [6] @2.50GHz
Processor base frequency 2.5GHz
Max turbo frequency 4.5GHz
Cache 8MB
Four cores (eight with hyper-threading)
45W TDP
8GB DDR4-2933 memory
Maximum of two memory channels
Memory bandwidth 45.8GBps
NVidia GeForce 1650 GPU [7]
Architecture: Turing (TU117)
Memory 4GB GDDR5
Memory speed 8bps
Memory bandwidth 128GBps
Memory bus 8-bit
L2 cache 1MB
TDP 75W
Base clock 1,485GHz
Boost clock 1,665MHz
896 CUDA cores

The two scripts were run several times (>15) for each case to get a feel for the performance; then, they were run for the results presented in this article.

Default BLAS Library

By default, Octave uses a multithreaded BLAS library. Specifically, Octave used the BLAS library located at /lib/x86_64-linux-gnu/libblas.so.3. The two scripts, one for single precision and one for double precision, were run under the default BLAS library. The straightforward command to run the single-precision code with all cores (the default) is:

$ octave-cli ./sgemm.m

To run with a single core, you modify the command slightly:

$ OMP_NUM_THREADS=1 octave-cli ./sgemm.m

The results for running the two scripts are presented in Table 2 (where GFLOPS is a billion floating-point operations per second). First, they are run on a single core, and then on all cores. A fair amount of variability is evident for N=256 and N=512, which is also true for all subsequent CPU results.

Tabelle 2: Octave Results with Default BLAS Library

	Single-Precision, One Core	Double-Precision, One Core	Single-Precision, All Cores	Double-Precision, All Cores
N	Elapsed Time (secs)	GFLOPS	Elapsed Time (secs)	GFLOPS	Elapsed Time (secs)	GFLOPS	Elapsed Time (secs)	GFLOPS
2	0.000702	0.000023	0.000427	0.000037	0.000961	0.000017	0.000137	0.000117
4	0.000069	0.001864	0.000076	0.001678	0.000099	0.001291	0.00092	0.001398
8	0.000069	0.014913	0.000061	0.016777	0.000092	0.011185	0.000084	0.012202
16	0.000061	0.134218	0.000061	0.134218	0.000092	0.089478	0.000084	0.097613
32	0.000076	0.858993	0.000076	0.858993	0.000099	0.660764	0.000107	0.613567
64	0.000099	5.286114	0.000145	3.616815	0.000153	3.435974	0.000206	2.545166
128	0.000313	13.408678	0.000587	7.139686	0.000565	7.429133	0.000473	8.867029
256	0.001785	18.795071	0.003654	9.181725	0.000542	61.944317	0.001144	29.32031
512	0.013779	19.481934	0.027763	9.668693	0.0047	57.117487	0.022438	11.963404
1,024	0.100395	21.390301	0.215065	9.985277	0.02961	72.526405	0.055252	38.867022
2,048	0.776039	22.137891	1.612694	10.652902	0.199173	86.256026	0.455025	37.755903
4,096	5.855209	23.472936	12.275261	11.196418	1.575951	87.21019	3.468651	39.623174
8,192	39.343849	27.946214	102.974144	10.677551	12.247917	89.771315	26.561623	41.394746

OpenBLAS

One of the most popular BLAS libraries is OpenBLAS [8], which you can use with the PRELOAD trick instead of the default BLAS library. The command to run the single-precision script is:

$ OMP_NUM_THREADS=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libopenblas.so.0 octave-cli ./sgemm.m

Table 3 contains the results. Note that the OpenBLAS library is installed from the Apt repository for this distribution and version. Likely, one built on the system could produce better results.

Tabelle 3: Octave Results with OpenBLAS Library

	Single-Precision, One Core	Double-Precision, One Core	Single-Precision, All Cores	Double-Precision, All Cores
N	Elapsed Time (secs)	GFLOPS	Elapsed Time (secs)	GFLOPS	Elapsed Time (secs)	GFLOPS	Elapsed Time (secs)	GFLOPS
2	0.000114	0.00014	0.000114	0.00014	0.001022	0.000016	0.000771	0.000021
4	0.000076	0.001678	0.000076	0.001678	0.000099	0.001291	0.000061	0.002097
8	0.000061	0.016777	0.000061	0.016777	0.000092	0.011185	0.000061	0.016777
16	0.000061	0.134218	0.000069	0.119305	0.000084	0.097613	0.000076	0.107374
32	0.000061	1.073742	0.000076	0.858993	0.000092	0.715828	0.000076	0.858993
64	0.000099	5.286114	0.000137	3.817749	0.000145	3.616815	0.000137	3.817749
128	0.000313	13.408678	0.000572	7.330078	0.000381	10.995116	0.000656	6.392509
256	0.001808	18.557158	0.003624	9.259045	0.000519	64.677155	0.001144	29.32031
512	0.013237	20.279177	0.026962	9.955963	0.004074	65.888337	0.008163	32.882591
1,024	0.101677	21.120656	0.20388	10.533061	0.035118	61.150332	0.052483	40.918008
2,048	0.774956	22.168839	1.59137	10.79565	0.201546	85.240558	0.410416	41.859683
4,096	5.741043	23.939718	11.007278	12.486188	1.558258	88.20038	3.523735	39.003771
8,192	39.33165	27.954882	84.512154	13.010101	12.305489	89.351318	26.867691	40.92319

NVBLAS

NVidia has several libraries you can use when writing programs. Some of these libraries are standard conforming libraries, such as cuBLAS [9]. NVidia has taken cuBLAS and used it as part of a "drop-in" replacement BLAS library, NVBLAS, that provides BLAS level 3 routines [10]. NVBLAS uses cuBLAS, both of which are included as part of CUDA [11]; simply follow the directions for downloading and installing CUDA. For this article, I used the cuBLAS and NVBLAS that came with the NVidia HPC SDK, version 21.3.

Before using NVBLAS, you have to configure it. From the NVBLAS documentation [12], "It must be configured through an ASCII text file that describes how many and which GPUs can participate in the intercepted BLAS calls." To use NVBLAS, create the file nvblas.conf in the directory in which you are running the scripts. For the example in this article, the contents of the file I used were:

# This is the configuration file to use NVBLAS Library
NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so.0
NVBLAS_GPU_LIST 0
NVBLAS_AUTOPIN_MEM_ENABLED

The first line of the file defines the logfile where NVBLAS writes any log information. The next line defines the CPU-only BLAS library for cases in which there is no GPU routine. The code defaults to running on the CPU and falls through to the CPU BLAS library, which the NVBLAS_CPU_BLAS_LIB variable specifies for NVBLAS. In this case, I chose to use the OpenBLAS library.

The third line lists the GPU devices that should be used. The numbering begins with 0. In this case, the laptop only has one NVidia GPU, so only one is listed. You can also use the keyword ALL to define all the GPUs in the system. The last line is something I used from an article about NVBLAS with Octave [13]. After configuring nvblas.conf, you have to take two steps to run Octave. The first step is to export the NVBLAS_CONFIG_FILE environment variable that points to the location of the nvblas.conf file:

export NVBLAS_CONFIG_FILE=$HOME/PROJECTS/OCTAVE/nvblas.conf

This environment variable just points to the ASCII configuration file you created. The second step is the run command itself, which uses the LD_PRELOAD trick to load NVBLAS first:

LD_PRELOAD=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/math_libs/11.2/targets/x86_64-linux/lib/libnvblas.so.11.4.1.1026 octave-cli ./sgemm.m

The command begins by defining LD_PRELOAD, pointing to the NVBLAS library, which is then followed by the command that runs Octave (octave-cli). To run the script, you can simply concatenate the two commands together (I tend to write a one-line Bash script for this). The results for the single- and double-precision scripts are shown in Table 4.

Tabelle 4: Octave Results with the NVBLAS Library

	Single-Precision, GPU	Double-Precision, GPU
N	Elapsed Time (secs)	GFLOPS	Elapsed Time (secs)	GFLOPS
2	0.001167	0.000014	0.001007	0.000016
4	0.000076	0.001678	0.000069	0.001864
8	0.000061	0.016777	0.000061	0.016777
16	0.000061	0.134218	0.000069	0.119305
32	0.000076	0.858993	0.000076	0.858993
64	0.000099	5.286114	0.000145	3.616815
128	0.000542	7.74304	0.000603	6.958934
256	0.000549	61.083979	0.001152	29.126136
512	0.016685	16.087962	0.012955	20.721067
1,024	0.008904	241.195353	0.039238	54.72975
2,048	0.01741	986.765913	0.250496	68.583432
4,096	0.093765	1465.776933	1.500099	91.619911
8,192	0.643051	1709.835418	12.03125	91.387979

The strange "blurp" in the results for N=512 I cannot explain, but it happens very frequently. Notice the strange results at N=256 and N=512 that also happened when using the CPU.

For the CPU results, the double-precision results are about half the single-precision results, which is expected. However, the GPU double-precision performance is less than half of the single-precision results, because the GPU used (the GeForce 1650) is a consumer-grade GPU with the focus primarily on 32-bit performance. However, as you can tell, it can run double-precision code, just not as well as the data center GPUs that focus on 64-bit performance.

Summary

The PRELOAD trick is something of a rite of passage for new system administrators. When they find out about the trick, it is a revelation because of how flexible it can be. Soon, it is no longer a trick but a part of what the admin uses every day. I hope the simple example of LD_PRELOAD in this article with GPUs for computation and without any code changes illustrates its utility.

If you knew of this trick but have forgotten it, or if you are just learning it, I hope this article proved useful.