GPUs on Grid5000
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
Introduction
This tutorial presents how to use GPU Accelerators and Intel Xeon Phi Coprocessors on Grid'5000. You will learn to reserve these resources, setup the environment and execute codes on the accelerators. Please note that this page is not about GPU or Xeon Phi programming and only focuses on the specificities of the Grid'5000 platform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is especially useful for testing the latest version of the accelerator software stack (such as the NVIDIA CUDA libraries or the Intel Manycore Platform Software Stack (MPSS)).
In this tutorial, we provide code examples that use the Level-3 BLAS function DGEMM to compute the product of the two matrices. BLAS libraries are available for a variety of computer architectures (including multicores and accelerators) and this code example is used on this tutorial as a toy benchmark to compare the performance of accelerators and/or available BLAS libraries.
This tutorial is divided into two distinct parts that can be done in any order:
For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the Getting Started tutorial first to get familiar with the platform (connections to the platform, resource reservations) and its basic concepts (job scheduling, environment deployment). The Hardware page is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see Status).
GPU accelerators on Grid'5000
Note that NVIDIA drivers (see nvidia-smi
) and CUDA (nvcc --version
) compilation tools are installed by default on nodes.
Reserving GPUs
If you only need a single GPU in the standard environment, reservation is as simple as:
In Nancy, you have to use the production queue for some GPU clusters, for instance:
If you require several GPUs for the same experiment (e.g. for inter-GPU communication or to distribute computation), you can reserve multiple GPUs of a single node:
Note | |
---|---|
When you run |
To select a specific model of GPU, use the "gpu_model" property, e.g.
The exact list of GPU models is available on the OAR properties page, and you can use Hardware page to have an overview of available GPUs on each site.
Reserving full nodes with GPUs
In some cases, you may want to reserve a complete node with all its GPUs. This allows you to customize the software environment with Sudo-g5k or even to deploy another operating system.
To make sure you obtain a node with a GPU, you can use the "gpu_count" property:
In Nancy, you have to use the production queue for most GPU clusters:
To select a specific model of GPU, you can also use the "gpu_model" property, e.g.
If you want to deploy an environment on the node, you should add the -t deploy
option.
GPU usage tutorial
In this section, we will give an example of GPU usage under Grid'5000.
Every steps of this tutorial must be performed on a Nvidia GPU node.
Run the CUDA Toolkit examples
In this part, we are going compile and execute examples provided by the CUDA Toolkit on the default (production) environment.
First, we copy CUDA 9.0 samples and extract them on /tmp/samples:
Note | |
---|---|
These samples are part of the CUDA 9.0 Toolkit and can also be extracted from the toolkit installer using the --extract=/path option. |
The CUDA examples are described in /tmp/samples/Samples.html
. You might also want to have a look at the doc/
directory or the online documentation.
You can compile all the examples at once. From the CUDA samples source directory (/tmp/samples
), run make to compile examples:
The compilation of all the examples is over when "Finished building CUDA samples" is printed. Each example can also be compiled separately from its own directory.
You can first try the Device Query
example located in /tmp/samples/1_Utilities/deviceQuery/
. It enumerates the properties of the CUDA devices present in the system.
Here is an example of the result on the graphique cluster at Nancy:
/tmp/samples/1_Utilities/deviceQuery/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "GeForce GTX TITAN Black" CUDA Driver Version / Runtime Version 9.0 / 9.0 CUDA Capability Major/Minor version number: 3.5 Total amount of global memory: 6082 MBytes (6377766912 bytes) (15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores GPU Max Clock rate: 980 MHz (0.98 GHz) Memory Clock rate: 3500 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "GeForce GTX TITAN Black" CUDA Driver Version / Runtime Version 9.0 / 9.0 CUDA Capability Major/Minor version number: 3.5 Total amount of global memory: 6082 MBytes (6377766912 bytes) (15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores GPU Max Clock rate: 980 MHz (0.98 GHz) Memory Clock rate: 3500 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 130 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from GeForce GTX TITAN Black (GPU0) -> GeForce GTX TITAN Black (GPU1) : No > Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX TITAN Black (GPU0) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2 Result = PASS
BLAS examples
We now run our BLAS example to illustrate GPU performance for dense matrix multiply.
The toolkit provides the CUBLAS library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available here and several advanced examples using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...).
The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides NVBLAS, a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be forcibly linked using the LD_PRELOAD environment variable.
To test NVBLAS, you can download and compile our matrix-matrix multiplication example:
You can first check the performance of the BLAS library on the CPU. For small matrix size (<5000), the provided example will compare the BLAS implementation to a naive jki-loop version of the matrix multiplication:
Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000) BLAS - Time elapsed: 1.724E+00 sec. J,K,I - Time elapsed: 7.233E+00 sec.
To offload the BLAS computation on the GPU, use:
[NVBLAS] Config parsed Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000) BLAS - Time elapsed: 1.249E-01 sec.
CPU/GPU comparisons becomes more meaningful with larger problems:
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 2.673E+01 sec.
[NVBLAS] Config parsed Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 1.718E+00 sec.
If you want to measure the time spent on data transfers to the GPU, you can use the simpleCUBLAS (/tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS
) example and instrument the code with timers.
Custom CUDA version or Nvidia drivers
Here, we explain how to use latest CUDA version with "module", use Nvidia Docker images and install the NVIDIA drivers and compilers before validating the installation on the previous example set.
Newer CUDA version using modules
Different CUDA versions can be loaded using "module" command. You should first choose the CUDA toolkit version that you will load with module tool:
----------------------- /grid5000/spack/share/spack/modules/linux-debian9-x86_64 ----------------------- [...] cuda/7.5.18_gcc-6.4.0 cuda/8.0.61_gcc-6.4.0 cuda/9.0.176_gcc-6.4.0 cuda/9.1.85_gcc-6.4.0 cuda/9.2.88_gcc-6.4.0 cuda/10.0.130_gcc-6.4.0 cuda/10.1.243_gcc-6.4.0 [...]
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
You should consult CUDA Toolkit and Compatible Driver Versions table and download the corresponding installer. (Cuda 10.0.x toolkit: driver version >= 410.48)
Copy and compile the sample examples
You now have everything installed. For instance, you can compile and run the toolkit examples (see #Compiling the CUDA Toolkit examples for more information):
/grid5000/spack/opt/spack/linux-debian9-x86_64/gcc-6.4.0/cuda-10.1.243-am4nmkjzn2gofwt2xvvwysbklkph2c2u/bin/nvcc
node :
|
cp -R /grid5000/spack/opt/spack/linux-debian9-x86_64/gcc-6.4.0/cuda-10.1.243-am4nmkjzn2gofwt2xvvwysbklkph2c2u/samples /tmp/
|
The newly created environment can be saved with tgz-g5k, to be reused later:
Note | |
---|---|
Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example |
Nvidia-docker
A script to install nvidia-docker is available if you want to use Nvidia's images builded for Docker and GPU nodes. See Nvidia Docker page.
Custom Nvidia driver using deployment
First, reserve a GPU node and deploy the debian11-x64-nfs
environment. This environment allows you to connect either as root (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:
Once the deployment is terminated, you should be able to connect to the node as root:
You can then perform the NVIDIA driver installation:
(warnings about X.Org can safely be ignored)
On the node you can check which NVIDIA drivers are installed with the nvidia-smi
tool:
Here is an example of the result on the chifflet cluster:
chifflet-7:~# nvidia-smi Tue Apr 9 15:56:10 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.57 Driver Version: 410.57 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 18% 29C P0 58W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A | | 19% 23C P0 53W / 250W | 0MiB / 11178MiB | 6% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Intel Xeon Phi (MIC) on Grid'5000
Reserve a Xeon Phi at Nancy
As NVIDIA GPU, Xeon Phi coprocessor cards provide additional compute power and can be used to offload computations. As those extension cards run a modified Linux kernel, it is also possible to log in directly onto the Xeon Phi via ssh. Also, it is possible to compile application for the Xeon Phi processor (which is based on x86 technology) and runs it natively on the embedded Linux system of the Xeon Phi card.
Xeon Phi 7120P are available at Nancy.
Since the current Xeon Phi available in Grid'5000 are already pretty old, they are not anymore supported by the default environment (OS) provided on nodes. As a result, one must deploy the previous version of Grid'5000 environments (based on Debian8 Jessie) on nodes in order to have the Mic stack installed.
To reserve a Grid'5000 node that includes a Xeon Phi, you can use this command:
Then, nodes need to be deployed with the Jessie Big environment:
You should then be able to ssh to the nodes and check the status of the MIC card using micinfo:
Warning | |
---|---|
If you want to make sure no one else ssh to your deployed node, connect to the node as root and run: |
Intel compilers
Intel compilers are the most appropriate compilers for the Intel Xeon Phi. Intel compilers are available using the module tool (intel-parallel-studio package), but they require a commercial (or academic) license and such licenses is not provided by Grid'5000. However, you may have access to a license server in your local laboratory. See Environment_modules for instructions on configuring access to a license server.
An other option is to compile your programs somewhere where the Intel compiler is available, and then copy your executable binary (compiled code) to your Grid'5000 nodes (beware of the CPU architecture homogeneity however).
Note that GCC and Clang also provide a limited support for newest Xeon Phi. See this page for more information.
Execution on Xeon Phi
Setup
Note | |
---|---|
Since MIC are not supported anymore in the standard environment, the setup becomes more complex and involves several root level technical steps |
First we have to make sure the MIC is able to access the Grid'5000 network. For that we can setup NAT on the host. The following commands have to be run as root on the host.
Then, we have to make sure the MIC is booted. You may look at the output of the dmesg
command, and if it is not booted, restart the MPSS service:
Warning | |
---|---|
As of Mar. 19th 2018, there is a bug in /etc/hosts of the environments: a newline is missing (bug #9125), you must fix it. |
Look again at the dmesg
command output and wait for the following line:
mic0: Transition from state booting to online
You should now be able to SSH to the mic as root. Make sure NFS is mounted
In order to connect to the Xeon Phi card with your Grid'5000 username using SSH, you have to setup your user. Run the following commands from the host as root:
then
Using the MIC
An introduction to the Xeon Phi programming environment is available on the Intel website. Other useful resources include:
Before using the Intel compilers or executing codes that dynamically link to Intel libraries, you have to set up your environment:
Offload mode
In offload mode, your program is executed on the Host, but part of its execution is offloaded to the co-processor card. Intel provided a code snippet on its tutorial that shows a sum reduction operation being run on a Xeon Phi processor. This example is available on the /grid5000/xeonphi/samples/ directory and can be compiled and executed as follow:
Native mode
In native mode, your program is completely executed on the Xeon Phi. Your code must be compiled natively for the Xeon Phi architecture using the -mmic option:
This program cannot be ran on the Host:
-bash: ./hello_mic: cannot execute binary file: Exec format error
To execute this program, you have to execute it directly on the MIC. You can ssh to the MIC and run the program:
BLAS examples
Download our BLAS' matrix-matrix multiplication example:
The Intel MKL (Intel Math Kernel Library) library provides an implementation of the BLAS. Our BLAS example can be linked with the MKL:
graphite :
|
icpc matmatmul.c -o matmatmul_mkl_seq -DHAVE_MKL -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread
|
MKL also provides a threaded version of the BLAS that can be used as follow:
graphite :
|
icpc matmatmul.c -o matmatmul_mkl -DHAVE_MKL -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -fopenmp
|
More information on the compilation options can be found here.
You can compare the performances of the different flavors:
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 3.605E+01 sec.
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 1.676E+01 sec.
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 1.855E+00 sec.
The MKL is also available natively on the Xeon Phi:
graphite :
|
icpc -mmic matmatmul.c -o matmatmul_mic -DHAVE_MKL -L${MKLROOT}/lib/mic -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -fopenmp
|
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 1.200E+00 sec.
MKL also provides an automatic offload mode that can be compared to the NVBLAS library of NVIDIA GPU:
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] 0.35 0.65 [MKL] [MIC 00] [AO DGEMM CPU Time] 3.050871 seconds [MKL] [MIC 00] [AO DGEMM MIC Time] 0.541124 seconds Time elapsed: 3.168E+00 sec.
The data transfer cost between the host and the accelerator is amortized for larger matrices:
Multiplying Matrices: C(10000x10000) = A(10000x10000) x B(10000x10000) BLAS - Time elapsed: 9.304E+00 sec.
Multiplying Matrices: C(10000x10000) = A(10000x10000) x B(10000x10000) BLAS - Time elapsed: 6.157E+00 sec.