GPUs on Grid5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
 
(29 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Maintainer|Simon Delamare}}
{{Maintainer|Pierre Neyron}}
{{Author|Jérémie Gaidamour}}
{{Portal|User}}
{{Portal|User}}
{{Portal|HPC}}
{{Portal|HPC}}
Line 7: Line 4:
{{Pages|HPC}}
{{Pages|HPC}}
{{TutorialHeader}}
{{TutorialHeader}}
= Introduction =
= Introduction =


This tutorial presents how to use GPU Accelerators and Intel Xeon Phi Coprocessors on Grid'5000. You will learn to reserve these resources, setup the environment and execute codes on the accelerators. Please note that this page is not about GPU or Xeon Phi programming and only focuses on the specificities of the Grid'5000 platform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is especially useful for testing the latest version of the accelerator software stack (such as the NVIDIA CUDA libraries or the Intel Manycore Platform Software Stack (MPSS)).
This tutorial presents how to use GPU Accelerators. You will learn to reserve these resources, setup the environment and execute codes on the accelerators. Please note that this page is not about GPU programming and only focuses on the specificities of the Grid'5000 platform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is especially useful for testing the latest version of the accelerator software stack (such as the NVIDIA CUDA libraries).


In this tutorial, we provide code examples that use the Level-3 [https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms BLAS] function DGEMM to compute the product of the two matrices. BLAS libraries are available for a variety of computer architectures (including multicores and accelerators) and this code example is used on this tutorial as a toy benchmark to compare the performance of accelerators and/or available BLAS libraries.
In this tutorial, we provide code examples that use the Level-3 [https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms BLAS] function DGEMM to compute the product of the two matrices. BLAS libraries are available for a variety of computer architectures (including multicores and accelerators) and this code example is used on this tutorial as a toy benchmark to compare the performance of accelerators and/or available BLAS libraries.


This tutorial is divided into two distinct parts that can be done in any order:
For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the [[Getting Started]] tutorial first to get familiar with the platform (connections to the platform, resource reservations) and its basic concepts (job scheduling, environment deployment).
* [[#GPU accelerators on Grid'5000]]
The [[Hardware#Accelerators (GPU, Xeon Phi)|Hardware page]] is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see [[Status]]).
* [[#Intel Xeon Phi (MIC) on Grid'5000]]


For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the [[Getting Started]] tutorial first to get familiar with the platform (connections to the platform, resource reservations) and its basic concepts (job scheduling, environment deployment).
Note that Intel Xeon Phi KNC (MICs) available in Nancy are no longer supported ([[Unmaintained:Intel Xeon Phi|documentation]] remains available)
The [[Hardware|Hardware page]] is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see [[Status]]).
 
= Nvidia GPU on Grid'5000 =
 
Note that NVIDIA drivers (see <code class="command">nvidia-smi</code>) and CUDA (<code class="command">nvcc --version</code>) compilation tools are installed by default on nodes.  
 
== Choosing a GPU ==


= GPU accelerators on Grid'5000 =
Have a look at per-site, detailed hardware pages (for instance, at [[Lyon:Hardware#gemini|Lyon]]), you will find here useful informations about GPUs:
* the card model name (see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units to know more about each model)
* the GPU memory size available for computations
* for NVidia GPU, their compute capability
* the hosting node characteristics (#cpu, qty of memory available, #gpus, reservable local disk availability, ...)
* the job access conditions (ie: default or production queue, max walltime partition for clusters in the production queues)


In this section, we first reserve a GPU node. We then compile and execute examples provided by the CUDA Toolkit on the default (production) environment. We also run our BLAS example to illustrate GPU performance for dense matrix multiply. Finally, we explain you how use latest CUDA version with module feature and install the  NVIDIA drivers and compilers before validating the installation on the previous example set.
=== About NVidia and CUDA compatibility with older GPUs ===


== A note on NVIDIA drivers and CUDA support ==
Most of GPU available in Grid'5000 are supported by Nvidia driver and CUDA delivered in Grid'5000 environments. As of October 2021, there are two exceptions:


NVIDIA drivers (see <code class="command">nvidia-smi</code>) and CUDA (<code class="command">nvcc --version</code>) compilation tools are installed by default on nodes. NVIDIA does not support older generation of GPU but this environment will work with GPU accelerators available on Grid'5000, as those currently installed on chifflet (Lille) and graphique (in the [[Nancy:Production|production queue]] of Nancy) are quite recent. See this [http://www.nvidia.com/Download/index.aspx?lang=en-us page] to find which versions fit you GPU.  
* K40m GPUs available in ''grimani'' cluster in Nancy requires the <code>nvcc</code> option <code>---gpu-architecture=sm_35</code> (35 for compute capability <code>3.5</code>) to be used with CUDA starting from version 11, which is the version shipped with our debian11 environment.


The following table summarizes the situation on Grid'5000 as of April 2019:
* M2075 GPUs (compute capability 2.0) of the ''orion'' cluster in Lyon is not supported by the driver shipped in our environments. GPUs in this cluster are no more usable from our environments and the ''gpu'' property used to select a GPU node using oarsub (see below) is disabled.  Not that it is still possible for to [[#Custom_Nvidia_driver_using_deployment|build an environment with custom driver]] to use these cards.


{| class="program" style="border:1px dotted black;"
See https://en.wikipedia.org/wiki/CUDA#GPUs_supported to know more about the relationship between Cuda versions and compute capability.
! Site
! Cluster
! GPU
! OAR properties
! Latest Driver version
! Latest CUDA version
|-
| Lyon
| orion (4 nodes)
| Nvidia Tesla-M2075 (1 per node)
| -p "gpu_count > 0"
| 396.44
| 9.0
|-
| Nancy
| graphique (6 nodes)
| Nvidia Titian Black GPU (2 on graphique-1),<br />
Nvidia GTX 980 GPU (2 per node)
| -p "gpu_count > 0" -q production
| 396.44
| 9.0
|-
| Nancy
| grele (14 nodes)
| Nvidia GTX 1080Ti (2 per node)
| -p "gpu_count > 0" -q production
| 396.44
| 9.0
|-
| Nancy
| grimani (6 nodes)
| Nvidia K40m GPU (2 per node)
| -p "gpu_count > 0" -q production
| 396.44
| 9.0
|-
| Lille
| chifflet (8 nodes)
| Nvidia GTX 1080Ti (2 per node)
| -p "gpu_count >0"
| 396.44
| 9.0
|-
| Lille
| chifflot (8 nodes)
| Nvidia Tesla P100 (2 on chifflot-[1-6]),<br />
Nvidia Tesla V100 (2 on chifflot-[7-8])
| -p "gpu_count > 0"
| 396.44
| 9.0
|}


== Reserving GPUs ==
== Reserving GPUs ==


=== Single GPU ===
If you only need a single GPU in the standard environment, reservation is as simple as:
If you only need a single GPU in the standard environment, reservation is as simple as:


Line 91: Line 48:
{{Note|text=On a multi-GPU node, this will give you only part of the memory and CPU resources. For instance, on a dual-GPU node, reserving a single GPU will give you access to half of the system memory and half of the CPU cores. This ensures that another user can reserve the other GPU and still have access to enough system memory and CPU cores.}}
{{Note|text=On a multi-GPU node, this will give you only part of the memory and CPU resources. For instance, on a dual-GPU node, reserving a single GPU will give you access to half of the system memory and half of the CPU cores. This ensures that another user can reserve the other GPU and still have access to enough system memory and CPU cores.}}


In Nancy, you have to use the production queue for some GPU clusters, for instance:
In Nancy, you have to use the production queue for most of the GPU clusters, for instance:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-q production</code> <code class="command">-l "gpu=1"</code>}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-q production</code> <code class="command">-l "gpu=1"</code>}}


Line 100: Line 57:
{{Note|text=When you run <code class="command">nvidia-smi</code>, you will only see the GPU(s) you reserved, even if the node has more GPUs. This is the expected behaviour.}}
{{Note|text=When you run <code class="command">nvidia-smi</code>, you will only see the GPU(s) you reserved, even if the node has more GPUs. This is the expected behaviour.}}


To select a specific model of GPU, use the "gpu_model" property, e.g.
To select a specific model of GPU, two possibilities:
 
'''use gpu model aliases, as describe in [[OAR Syntax simplification#GPUs]], e.g.'''
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-l gpu=1</code> <code class="command">-p </code><code class="replace">gpu_alias</code>}}
 
'''use the "gpu_model" property, e.g.'''
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-l gpu=1</code> <code class="command">-p "gpu_model =</code> '<code class="replace">GPU model</code>'"}}  
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-l gpu=1</code> <code class="command">-p "gpu_model =</code> '<code class="replace">GPU model</code>'"}}  


The exact list of GPU models is available on the [[OAR_Properties#gpu_model|OAR properties page]], and you can use [[Hardware#Accelerators_.28GPU.2C_Xeon_Phi.29|Hardware page]] to have an overview of available GPUs on each site.
The exact list of GPU models is available on the [[OAR_Properties#gpu_model|OAR properties page]], and you can use [[Hardware#Accelerators_.28GPU.2C_Xeon_Phi.29|Hardware page]] to have an overview of available GPUs on each site.


== Reserving full nodes with GPUs ==
=== Reserving full nodes with GPUs ===


In some cases, you may want to reserve a complete node with all its GPUs. This allows you to customize the software environment with [[Sudo-g5k]] or even to [[Getting_Started|deploy another operating system]].
In some cases, you may want to reserve a complete node with all its GPUs. This allows you to customize the software environment with [[Sudo-g5k]] or even to [[Getting_Started|deploy another operating system]].
Line 120: Line 82:
If you want to deploy an environment on the node, you should add the <code class="command">-t deploy</code> option.
If you want to deploy an environment on the node, you should add the <code class="command">-t deploy</code> option.


== Copy the CUDA Toolkit examples ==
=== Note about AMD GPU ===
 
As of October 2021, AMD GPUs are available in a single Grid'5000 cluster, [[Lyon:Hardware#neowise|neowise]], in Lyon. <code class="command">oarsub</code> commands shown above could give you either NVidia or AMD GPUs. The ''gpu_model'' property may be used to filter between GPU vendors. For instance:
 
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-p "gpu_count > 0 AND gpu_model NOT LIKE 'Radeon%'"</code>}}
 
will filter out Radeon GPUs (=AMD GPUs). See [[#AMD GPU on Grid'5000|below]] for more information about AMD GPUs.
 
== GPU usage tutorial ==
 
In this section, we will give an example of GPU usage under Grid'5000.
 
Every steps of this tutorial must be performed on a Nvidia GPU node.
 
=== Run the CUDA Toolkit examples ===
 
In this part, we are going compile and execute CUDA examples provided by Nvidia using CUDA Toolkit available on the default (standart) environment.
 
First, we retrieve the version of CUDA installed on the node:


We copy CUDA 9.0 samples and extract them on /tmp/samples:
<pre>
{{Term|location=node|cmd= cp -r /usr/local/cuda/samples /tmp }}
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
</pre>


Version is 11.2. We are going to download the corresponding CUDA samples.


{{Note|text=These samples are part of the [https://developer.nvidia.com/cuda-90-download-archive CUDA 9.0 Toolkit] and can also be extracted from the toolkit installer using the ''--extract=/path'' option.}}
<pre>
cd /tmp
git clone --depth 1 --branch v11.2 https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples
</pre>


The CUDA examples are described in <code class="file">/tmp/samples/Samples.html</code>. You might also want to have a look at the <code class="file">doc/</code> directory or the [http://docs.nvidia.com/cuda/cuda-samples/index.html#getting-started-with-cuda-samples online documentation].
You can compile all the examples at once by running make:


== Compiling the CUDA Toolkit examples ==
<pre>
make -j8
</pre>


You can compile all the examples at once. From the CUDA samples source directory (<code class="file">/tmp/samples</code>), run make to compile examples:
{{Term|location=node|cmd=<code class="command">cd /tmp/samples</code>}}
{{Term|location=node|cmd=<code class="command">make -j8</code>}}
The compilation of all the examples is over when "Finished building CUDA samples" is printed.   
The compilation of all the examples is over when "Finished building CUDA samples" is printed.   
Each example can also be compiled separately from its own directory.


You can first try the <code class="file">Device Query</code> example located in <code class="file">/tmp/samples/1_Utilities/deviceQuery/</code>. It enumerates the properties of the CUDA devices present in the system.
Each example is available from its own directory, under <code class="file">Samples</code> root directory (it can also be compiled separately from there).
{{Term|location=node|cmd=<code class="command">/tmp/samples/1_Utilities/deviceQuery/deviceQuery</code>}}
 
You can first try the <code class="file">Device Query</code> example located in <code class="file">Samples/deviceQuery/</code>. It enumerates the properties of the CUDA devices present in the system.


Here is an example of the result on the graphique cluster at Nancy:
<pre>
<pre>
/tmp/samples/1_Utilities/deviceQuery/deviceQuery Starting...
/tmp/cuda-samples/Samples/deviceQuery/deviceQuery
</pre>
 
Here is an example of the result on the chifflet cluster at Lille:
 
<pre>
/tmp/cuda-samples/Samples/deviceQuery/deviceQuery Starting...
 
  CUDA Device Query (Runtime API) version (CUDART static linking)
  CUDA Device Query (Runtime API) version (CUDART static linking)


Detected 2 CUDA Capable device(s)
Detected 2 CUDA Capable device(s)


Device 0: "GeForce GTX TITAN Black"
Device 0: "GeForce GTX 1080 Ti"
   CUDA Driver Version / Runtime Version          9.0 / 9.0
   CUDA Driver Version / Runtime Version          11.2 / 11.2
   CUDA Capability Major/Minor version number:    3.5
   CUDA Capability Major/Minor version number:    6.1
   Total amount of global memory:                6082 MBytes (6377766912 bytes)
   Total amount of global memory:                11178 MBytes (11721506816 bytes)
   (15) Multiprocessors, (192) CUDA Cores/MP:    2880 CUDA Cores
   (28) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
   GPU Max Clock rate:                            980 MHz (0.98 GHz)
   GPU Max Clock rate:                            1582 MHz (1.58 GHz)
   Memory Clock rate:                            3500 Mhz
   Memory Clock rate:                            5505 Mhz
   Memory Bus Width:                              384-bit
   Memory Bus Width:                              352-bit
   L2 Cache Size:                                1572864 bytes
   L2 Cache Size:                                2883584 bytes
   Maximum Texture Dimension Size (x,y,z)        1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
   Maximum Texture Dimension Size (x,y,z)        1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
   Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
   Total amount of constant memory:              65536 bytes
   Total amount of constant memory:              65536 bytes
   Total amount of shared memory per block:      49152 bytes
   Total amount of shared memory per block:      49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
   Total number of registers available per block: 65536
   Total number of registers available per block: 65536
   Warp size:                                    32
   Warp size:                                    32
Line 170: Line 167:
   Maximum memory pitch:                          2147483647 bytes
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                            512 bytes
   Texture alignment:                            512 bytes
   Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
   Run time limit on kernels:                    No
   Run time limit on kernels:                    No
   Integrated GPU sharing Host Memory:            No
   Integrated GPU sharing Host Memory:            No
Line 177: Line 174:
   Device has ECC support:                        Disabled
   Device has ECC support:                        Disabled
   Device supports Unified Addressing (UVA):      Yes
   Device supports Unified Addressing (UVA):      Yes
   Supports Cooperative Kernel Launch:            No
  Device supports Managed Memory:                Yes
   Supports MultiDevice Co-op Kernel Launch:      No
  Device supports Compute Preemption:            Yes
   Device PCI Domain ID / Bus ID / location ID:  0 / 3 / 0
   Supports Cooperative Kernel Launch:            Yes
   Supports MultiDevice Co-op Kernel Launch:      Yes
   Device PCI Domain ID / Bus ID / location ID:  0 / 4 / 0
   Compute Mode:
   Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >


Device 1: "GeForce GTX TITAN Black"
Device 1: "GeForce GTX 1080 Ti"
   CUDA Driver Version / Runtime Version          9.0 / 9.0
   CUDA Driver Version / Runtime Version          11.2 / 11.2
   CUDA Capability Major/Minor version number:    3.5
   CUDA Capability Major/Minor version number:    6.1
   Total amount of global memory:                6082 MBytes (6377766912 bytes)
   Total amount of global memory:                11178 MBytes (11721506816 bytes)
   (15) Multiprocessors, (192) CUDA Cores/MP:    2880 CUDA Cores
   (28) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
   GPU Max Clock rate:                            980 MHz (0.98 GHz)
   GPU Max Clock rate:                            1582 MHz (1.58 GHz)
   Memory Clock rate:                            3500 Mhz
   Memory Clock rate:                            5505 Mhz
   Memory Bus Width:                              384-bit
   Memory Bus Width:                              352-bit
   L2 Cache Size:                                1572864 bytes
   L2 Cache Size:                                2883584 bytes
   Maximum Texture Dimension Size (x,y,z)        1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
   Maximum Texture Dimension Size (x,y,z)        1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
   Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
   Total amount of constant memory:              65536 bytes
   Total amount of constant memory:              65536 bytes
   Total amount of shared memory per block:      49152 bytes
   Total amount of shared memory per block:      49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
   Total number of registers available per block: 65536
   Total number of registers available per block: 65536
   Warp size:                                    32
   Warp size:                                    32
Line 205: Line 205:
   Maximum memory pitch:                          2147483647 bytes
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                            512 bytes
   Texture alignment:                            512 bytes
   Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
   Run time limit on kernels:                    No
   Run time limit on kernels:                    No
   Integrated GPU sharing Host Memory:            No
   Integrated GPU sharing Host Memory:            No
Line 212: Line 212:
   Device has ECC support:                        Disabled
   Device has ECC support:                        Disabled
   Device supports Unified Addressing (UVA):      Yes
   Device supports Unified Addressing (UVA):      Yes
   Supports Cooperative Kernel Launch:            No
  Device supports Managed Memory:                Yes
   Supports MultiDevice Co-op Kernel Launch:      No
  Device supports Compute Preemption:            Yes
   Supports Cooperative Kernel Launch:            Yes
   Supports MultiDevice Co-op Kernel Launch:      Yes
   Device PCI Domain ID / Bus ID / location ID:  0 / 130 / 0
   Device PCI Domain ID / Bus ID / location ID:  0 / 130 / 0
   Compute Mode:
   Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX TITAN Black (GPU0) -> GeForce GTX TITAN Black (GPU1) : No
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : No
> Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX TITAN Black (GPU0) : No
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : No


deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.2, NumDevs = 2
Result = PASS
Result = PASS
</pre>


=== BLAS examples ===


</pre>
We now run our BLAS example to illustrate GPU performance for dense matrix multiply.
 
== BLAS examples ==


The toolkit provides the [https://developer.nvidia.com/cublas CUBLAS] library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [http://docs.nvidia.com/cuda/cublas/index.html here] and several [http://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries advanced examples] using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...).
The toolkit provides the [https://developer.nvidia.com/cublas CUBLAS] library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [http://docs.nvidia.com/cuda/cublas/index.html here] and several [http://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries advanced examples] using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...).


The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides [http://docs.nvidia.com/cuda/nvblas/ NVBLAS], a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be [http://www.manpages.info/linux/ld.so.8.html forcibly linked] using the LD_PRELOAD environment variable.
The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides [http://docs.nvidia.com/cuda/nvblas/ NVBLAS], a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be [https://man7.org/linux/man-pages/man8/ld.so.8.html forcibly linked] using the LD_PRELOAD environment variable.


To test NVBLAS, you can download and compile our matrix-matrix multiplication example:
To test NVBLAS, you can download and compile our matrix-matrix multiplication example:
Line 249: Line 251:
   BLAS  - Time elapsed:  1.249E-01 sec.
   BLAS  - Time elapsed:  1.249E-01 sec.


CPU/GPU comparisons becomes more meaningful with larger problems:
Depending on node hardware, GPU might perform better on larger problems:
{{Term|location=node|cmd=./matmatmul 5000}}
{{Term|location=node|cmd=./matmatmul 5000}}
   Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
   Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
Line 259: Line 261:
   BLAS  - Time elapsed:  1.718E+00 sec.
   BLAS  - Time elapsed:  1.718E+00 sec.


If you want to measure the time spent on data transfers to the GPU, you can use the simpleCUBLAS (<code class="file">/tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS</code>) example and instrument the code with timers.
If you want to measure the time spent on data transfers to the GPU, have a look to the simpleCUBLAS (<code class="file">/tmp/cuda-samples/Samples/simpleCUBLAS</code>) example and instrument the code with timers.


== Custom Nvidia drivers or CUDA toolkit version ==
== Custom CUDA version or Nvidia drivers ==


=== Custom Nvidia driver using deployment ===
Here, we explain how to use other CUDA versions with [[Modules]], use Nvidia Docker images and install the NVIDIA drivers and compilers before validating the installation on the previous example set.


First, reserve a GPU node and deploy the <code class="file">debian10-x64-nfs</code> environment. This environment allows you to connect either as ''root'' (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:
=== Older or newer CUDA version using modules ===
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "gpu_count > 0" -l /nodes=1,walltime=2}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e debian10-x64-nfs -k}}


Once the deployment is terminated, you should be able to connect to the node as root:
Different CUDA versions can be loaded using "module" command. You should first choose the CUDA toolkit version that you will load with module tool:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}
 
You can then perform the NVIDIA driver installation:
 
{{Term|location=node|cmd=apt-get -y install linux-headers-amd64 make g++}}
 
{{Term|location=node|cmd=<code class="command">wget https://download.nvidia.com/XFree86/Linux-x86_64/410.57/NVIDIA-Linux-x86_64-410.57.run</code>}}
 
{{Term|location=node|cmd=<code class="command">sh NVIDIA-Linux-x86_64-410.57.run -s --no-install-compat32-libs</code>}}
(warnings about X.Org can safely be ignored)
 
On the node you can check which NVIDIA drivers are installed with the <code class="command">nvidia-smi</code> tool:
 
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}}
 
Here is an example of the result on the chifflet cluster:


{{Term|location=node|cmd= module av cuda}}
<pre>
<pre>
chifflet-7:~# nvidia-smi
Tue Apr  9 15:56:10 2019     
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57                Driver Version: 410.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|  0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 18%  29C    P0    58W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  1  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 19%  23C    P0    53W / 250W |      0MiB / 11178MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
                                                                             
+-----------------------------------------------------------------------------+
| Processes:                                                      GPU Memory |
|  GPU      PID  Type  Process name                            Usage      |
|=============================================================================|
|  No running processes found                                                |
+-----------------------------------------------------------------------------+
</pre>
=== Newer CUDA version with module ===


You should choose the CUDA toolkit version that you will load with module tool:
------------- /grid5000/spack/v1/share/spack/modules/linux-debian11-x86_64_v2 ----------------
  cuda/11.4.0_gcc-10.4.0    cuda/11.6.2_gcc-10.4.0    cuda/11.7.1_gcc-10.4.0 (D)


{{Term|location=node|cmd= module av}}
<pre>
----------------------- /grid5000/spack/share/spack/modules/linux-debian9-x86_64 -----------------------
[...]
cuda/7.5.18_gcc-6.4.0
cuda/8.0.61_gcc-6.4.0
cuda/9.0.176_gcc-6.4.0
cuda/9.1.85_gcc-6.4.0
cuda/9.2.88_gcc-6.4.0
cuda/10.0.130_gcc-6.4.0
cuda/10.1.243_gcc-6.4.0
[...]
</pre>
</pre>


{{Term|location=node|cmd= module load cuda/10.0.130_gcc-6.4.0}}
{{Term|location=node|cmd= module load cuda/11.6.2_gcc-10.4.0}}


{{Term|location=node|cmd= nvcc --version}}
{{Term|location=node|cmd= nvcc --version}}
<pre>
<pre>
nvcc: NVIDIA (R) Cuda compiler driver
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 10.1, V10.1.243
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
</pre>
</pre>


You should consult [https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver CUDA Toolkit and Compatible Driver Versions] table and download the corresponding installer. (Cuda 10.0.x toolkit: driver version >= 410.48)  
You should consult [https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver CUDA Toolkit and Compatible Driver Versions] to ensure compatibility with a specific Cuda version and the Nvidia GPU driver (for instance, Cuda 11.x toolkit requires a driver version >= 450.80.02)


=== Copy and compile the sample examples ===
=== Copy and compile the sample examples ===


You now have everything installed. For instance, you can compile and run the toolkit examples (see [[#Compiling the CUDA Toolkit examples]] for more information):
You now have everything installed. For instance, you can compile and run the toolkit examples (see [[#Compiling the CUDA Toolkit examples]] for more information).
 
You will need to override the CUDA path variable, and also load the matching compiler version from modules:


{{Term|location=node|cmd=which nvcc}}
{{Term|location=node|cmd=which nvcc}}
<pre>/grid5000/spack/opt/spack/linux-debian9-x86_64/gcc-6.4.0/cuda-10.1.243-am4nmkjzn2gofwt2xvvwysbklkph2c2u/bin/nvcc</pre>
<pre>/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e/bin/nvcc</pre>
{{Term|location=node|cmd=cp -R /grid5000/spack/opt/spack/linux-debian9-x86_64/gcc-6.4.0/cuda-10.1.243-am4nmkjzn2gofwt2xvvwysbklkph2c2u/samples /tmp/}}
{{Term|location=node|cmd=cd /tmp/samples}}
{{Term|location=node|cmd=make -j8}}
{{Term|location=node|cmd=./0_Simple/matrixMulCUBLAS/matrixMulCUBLAS}}


{{Term|location=node|cmd=export CUDA_PATH=/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e}}
{{Term|location=node|cmd=module load gcc/10.4.0_gcc-10.4.0}}


The newly created environment can be saved with tgz-g5k, to be reused later:
And then you can build and run the examples:
{{Term|location=frontend|cmd=<code class="command">tgz-g5k</code> -m `head -1 $OAR_FILE_NODE` -f <code class="replace">myimagewithcuda</code>.tgz}}
{{Note|text=Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example}}
{{Warning|text= Please note that /tmp is erased at the boot of the node. If you want to use tgz-g5k to save your CUDA installation, make sure to copy everything you need from /tmp to another directory or you won't retrieve it after deploying your custom envrionement}}


{{Term|location=node|cmd=git clone --depth 1 --branch v11.6 https://github.com/NVIDIA/cuda-samples.git /tmp/cuda-samples}}
{{Term|location=node|cmd=cd /tmp/cuda-samples}}
{{Term|location=node|cmd=make -j32}}
{{Term|location=node|cmd=./Samples/0_Introduction/matrixMul/matrixMull}}


<!--
{{Note|text=Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example}}
=== Installing the CUDA toolkit on a deployed environment ===


GPU nodes at Lyon and Nancy are supported by the latest GPU drivers. See this [http://www.nvidia.com/Download/index.aspx?lang=en-us page] to find which versions fit you GPU. The following table summarizes the situation on Grid'5000 as of April 2019:
=== Nvidia-docker ===
A script to install nvidia-docker is available if you want to use Nvidia's images builded for Docker and GPU nodes. This provides an alternative way of making CUDA and Nvidia libraries available to the node. See [[Docker#Nvidia-docker|Nvidia Docker page]].


{| class="program" style="border:1px dotted black;"
=== Custom Nvidia driver using deployment ===
! Site
! Cluster
! GPU
! OAR properties
! Latest Driver version
! Latest CUDA version
|-
| Lyon
| orion (4 nodes)
| Nvidia Tesla-M2075 (1 per node)
| -p "GPU!='NO'"
| 396.44
| 9.0
|-
| Nancy
| graphique (6 nodes)
| Nvidia Titian Black GPU (2 on graphique-1),<br />
Nvidia GTX 980 GPU (2 per node)
| -p "GPU!='NO'" -q production
| 396.44
| 9.0
|-
| Nancy
| grimani (6 nodes)
| Nvidia K40m GPU (2 per node)
| -p "GPU!='NO'" -q production
| 396.44
| 9.0
|-
| Lille
| chifflet (8 nodes)
| Nvidia GTX 1080Ti (2 per node)
| -p "GPU!='NO'"
| 396.44
| 9.0
|}


A custom Nvidia driver may be installed on a node if needed. As ''root'' privileges are required, we will use kadepoy to deploy a <code class="file">debian11-x64-nfs</code> environment on the GPU node you reserved.


This environment allows you to connect either as ''root'' (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:


==== Deployment ====
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "gpu_count > 0" -l /nodes=1,walltime=2}}  
 
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e debian11-x64-nfs -k}}  
First, reserve a GPU node and deploy the <code class="file">debian10-x64-nfs</code> environment. This environment allows you to connect either as ''root'' (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "GPU!='<code class="replace">NO</code>'" -l /nodes=1,walltime=2}}  
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e debian10-x64-nfs -k}}  


Once the deployment is terminated, you should be able to connect to the node as root:
Once the deployment is terminated, you should be able to connect to the node as root:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}


==== Downloading the NVIDIA toolkit ====
You can then perform the NVIDIA driver installation:
We will now install the NVIDIA drivers, compilers, libraries and examples. The complete CUDA distribution can be downloaded from [https://developer.nvidia.com/cuda-toolkit-archive the official website] or from <code>https://www.grid5000.fr/packages/debian/</code>. Select a toolkit version compatible with your GPU hardware:
{{Term|location=node|cmd=cd /tmp/; wget https://www.grid5000.fr/packages/debian/cuda_9.0.176_384.81_linux-run}}


When the download is over, you can look at the installer options:
{{Term|location=node|cmd=apt-get -y install linux-headers-amd64 make g++}}
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_<version>.run --help}}


There are actually three distinct installers (for the drivers, compilers and examples) embedded on the file and you can extract them using:
{{Term|location=node|cmd=<code class="command">wget https://download.nvidia.com/XFree86/Linux-x86_64/470.82.01/NVIDIA-Linux-x86_64-470.82.01.run</code>}}
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_<version>.run --extract=/tmp/installers}}
With cuda_9.0.176_384.81_linux-run, you obtain:
* NVIDIA-Linux-x86_64-384.81.run: the drivers installer (version 352.39)
* cuda-linux.9.0.176-22781540.run: the CUDA toolkit installer (ie. compilers, librairies)
* cuda-samples.9.0.176-22781540-linux.run: the CUDA samples installer
Each installer provides a --help option.


==== NVIDIA driver and CUDA installation ====
{{Term|location=node|cmd=<code class="command">rmmod nouveau</code>}}


To install the linux driver (ie. kernel module), we need the <code>kernel header files</code> and <code>gcc 6.3</code> (as the module should be compiled with the same version of gcc used to compile the kernel in the first place). For CUDA and its samples, we also need g++ and make:
{{Term|location=node|cmd=<code class="command">sh NVIDIA-Linux-x86_64-470.82.01.run -s --no-install-compat32-libs</code>}}
 
{{Term|location=node|cmd=apt-get update && apt-get upgrade && reboot}}
{{Term|location=node|cmd=apt-get -y install make g++ linux-headers-amd64 # it also installs gcc-6.3}}
 
To compile and install the kernel module, use:
{{Term|location=node|cmd=cd /tmp/installers}}
{{Term|location=node|cmd=sh NVIDIA-Linux-x86_<version>.run --accept-license --silent --no-install-compat32-libs # note: do not use --no-install-compat32-libs with CUDA 6.5}}
(warnings about X.Org can safely be ignored)
(warnings about X.Org can safely be ignored)


To install the CUDA toolkit, use:
On the node you can check which NVIDIA drivers are installed with the <code class="command">nvidia-smi</code> tool:
{{Term|location=node|cmd=<code class="command">sh</code> cuda-linux64-rel-<version>.run -noprompt}}


You can add the CUDA toolkit to your current shell environment by using:
{{Term|location=node|cmd=export PATH=$PATH:/usr/local/cuda/bin}}
{{Term|location=node|cmd=export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64}}
To make those environment variables permanent for future ssh sessions you can edit the default PATH in <code class="file">/etc/profile</code> and add a configuration file for the dynamic linker under <code class="file">/etc/ld.so.conf.d/</code> as follow:
{{Term|location=node|cmd=<code class="command">sed</code> -e "s,:/bin,:/bin:/usr/local/cuda/bin," -i /etc/profile}}
{{Term|location=node|cmd=<code class="command">echo</code> -e "/usr/local/cuda/lib\n/usr/local/cuda/lib64" > /etc/ld.so.conf.d/cuda.conf}}
If you modify <code class="file">/etc/ld.so.conf.d/</code>, you also need to run <code class="command">ldconfig</code> as root to update the linker configuration.
{{Term|location=node|cmd=<code class="command">ldconfig</code>}}
To check if NVIDIA drivers are correctly installed, you can use the <code class="command">nvidia-smi</code> tool:
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}}
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}}


Here is an example of the result on the chifflet cluster:
Here is an example of the result on the graphique cluster:


<pre>
<pre>
root@chifflet-8:/tmp/installers$ nvidia-smi
root@graphique-4:~# nvidia-smi
Mon Jan 22 15:29:07 2018   
Tue Jun 27 19:37:15 2023     
 
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                Driver Version: 384.81                    |
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4    |
|-------------------------------+----------------------+----------------------+
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|                              |                      |              MIG M. |
|===============================+======================+======================|
|===============================+======================+======================|
|  0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 2325C   P0    58W / 250W |      0MiB / 11172MiB |      0%      Default |
| 2628C   P0    46W / 180W |      0MiB / 4043MiB |      0%      Default |
|                              |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-------------------------------+----------------------+----------------------+
|  1  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
|  1  NVIDIA GeForce ...  Off  | 00000000:82:00.0 Off |                  N/A |
| 2322C   P0    57W / 250W |      0MiB / 11172MiB |      0%      Default |
| 2827C   P0    43W / 180W |      0MiB / 4043MiB |      2%      Default |
|                              |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-------------------------------+----------------------+----------------------+
                                                                                
                                                                                
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
| Processes:                                                                 |
|  GPU       PID  Type  Process name                             Usage      |
|  GPU   GI  CI        PID  Type  Process name                 GPU Memory |
|        ID  ID                                                  Usage      |
|=============================================================================|
|=============================================================================|
|  No running processes found                                                |
|  No running processes found                                                |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+


</pre>  
</pre>
 
You now have everything installed. You can connect to your node and use it using your regular Grid'5000 user. For instance, you can compile and run the toolkit examples (see [[#Compiling the CUDA Toolkit examples]] for more information):
 
{{Term|location=frontend|cmd=<code class="command">ssh</code> `head -1 $OAR_NODE_FILE`  #As a normal user, not root}}
{{Term|location=node|cmd=cd /tmp/installers}}
{{Term|location=node|cmd=sh cuda-samples-linux-<version>.run -noprompt  -prefix=/tmp/samples -cudaprefix=/usr/local/cuda/}}
{{Term|location=node|cmd=cd /tmp/samples}}
{{Term|location=node|cmd=make -j8}}
{{Term|location=node|cmd=./0_Simple/matrixMulCUBLAS/matrixMulCUBLAS}}
 
 
The newly created environment can be saved with [[TGZ-G5K|tgz-g5k]], to be reused later:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_FILE_NODE` tgz-g5k > <code class="replace">myimagewithcuda</code>.tgz}}
{{Note|text=Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example}}
{{Warning|text= Please note that /tmp is erased at the boot of the node. If you want to use tgz-g5k to save your CUDA installation, make sure to copy everything you need from /tmp to another directory or you won't retrieve it after deploying your custom envrionement}}
 
-->
 
== Nvidia-docker ==
A script to install nvidia-docker is available if you want to use Nvidia's images builded for Docker and GPU nodes. See [[Docker#Nvidia-docker|Nvidia Docker page]].
 
= Intel Xeon Phi (MIC) on Grid'5000 =
 
{{Note|text=This tutorial concerns the Knight Corner generation of the MIC (Intel Many Integrated Core Architecture), which are coprocessor cards (PCIe). At the time of writing this, Grid'5000 does not provide MIC of the next generations (e.g. KNL). }}
 
== Reserve a Xeon Phi at Nancy ==
 
As NVIDIA GPU, [https://en.wikipedia.org/wiki/Xeon_Phi Xeon Phi] coprocessor cards provide additional compute power and can be used to offload computations. As those extension cards run a modified Linux kernel, it is also possible to log in directly onto the Xeon Phi via ssh. Also, it is possible to compile application for the Xeon Phi processor (which is based on x86 technology) and runs it natively on the embedded Linux system of the Xeon Phi card.
 
Xeon Phi [http://ark.intel.com/products/75799/Intel-Xeon-Phi-Coprocessor-7120P-16GB-1_238-GHz-61-core 7120P] are available at Nancy.
 
Since the current Xeon Phi available in Grid'5000 are already pretty old, they are not anymore supported by the default environment (OS) provided on nodes. As a result, one must deploy the previous version of Grid'5000 environments (based on Debian8 Jessie) on nodes in order to have the Mic stack installed.
 
To reserve a Grid'5000 node that includes a Xeon Phi, you can use this command:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "MIC='YES'" -t deploy}}
 
Then, nodes need to be deployed with the Jessie Big environment:
 
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -e jessie-x64-big -f $OAR_NODE_FILE -k}}
 
You should then be able to ssh to the nodes and check the status of the MIC card using micinfo:
 
{{Term|location=graphite|cmd=<code class="command">micinfo</code>}}
 
{{Warning|text=If you want to make sure no one else ssh to your deployed node, connect to the node as '''root''' and run:
  {{Term|location=graphite|cmd=<code class="command">echo</code> '-:ALL EXCEPT <code class="replace">your_g5k_username</code> root:ALL' >> /etc/security/access.conf}}
}}
 
== Intel compilers ==
 
Intel compilers are the most appropriate compilers for the Intel Xeon Phi. Intel compilers are available using the [[Environment_modules|module tool]] (''intel-parallel-studio'' package), but they require a commercial (or academic) license and such licenses is not provided by Grid'5000. However, you may have access to a license server in your local laboratory. See [[Environment_modules]] for instructions on configuring access to a license server.
 
An other option is to compile your programs somewhere where the Intel compiler is available, and then copy your executable binary (compiled code) to your Grid'5000 nodes (beware of the CPU architecture homogeneity however).
 
Note that [https://gcc.gnu.org/wiki/Offloading GCC] and [http://openmp.llvm.org/ Clang] also provide a limited support for newest Xeon Phi. See [https://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm this page] for more information.
 
== Execution on Xeon Phi ==
=== Setup ===
{{Note|text=Since MIC are not supported anymore in the standard environment, the setup becomes more complex and involves several root level technical steps}}
First we have to make sure the MIC is able to access the Grid'5000 network. For that we can setup NAT on the host. The following commands have to be run as '''root''' on the host.
 
{{Term|location=graphite|cmd=<code class="command">iptables</code> -t nat -A POSTROUTING ! -o mic0 -j MASQUERADE}}
{{Term|location=graphite|cmd=<code class="command">echo</code> 1 > /proc/sys/net/ipv4/ip_forward}}
 
Then, we have to make sure the MIC is booted. You may look at the output of the <code class="command">dmesg</code> command, and if it is not booted, restart the MPSS service:
 
{{Term|location=graphite|cmd=<code class="command">service</code> mpss restart}}
 
{{Warning|text=As of Mar. 19th 2018, there is a bug in /etc/hosts of the environments: a newline is missing ({{Bug|9125}}), you must fix it.}}
 
Look again at the <code class="command">dmesg</code> command output and wait for the following line:
  mic0: Transition from state booting to online
 
You should now be able to SSH to the mic as root. Make sure NFS is mounted
 
{{Term|location=graphite-mic0|cmd=<code class="command">mount</code> -a}}
 
In order to connect to the Xeon Phi card with your Grid'5000 username using SSH, you have to setup your user. Run the following commands from the host as root:
 
{{Term|location=graphite|cmd=<code class="command">T=</code>($(getent passwd <code class='replace'>your_g5k_username</code> {{!}} tr ' ' _ {{!}} tr : ' '))}}
then
{{Term|location=graphite|cmd=<code class="command">micctrl</code> --useradd=${T[0]} --uid=${T[2]} --gid=${T[3]} --sshkeys=${T[5]}/.ssh}}
 
=== Using the MIC ===
 
An introduction to the Xeon Phi programming environment is available [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment on the Intel website]. Other useful resources include:
* [http://spscicomp.org/wordpress/pages/the-intel-xeon-phi/ The IBM HPC Systems Scientific Computing User Group Tutorial]
* [http://www.hpc.cineca.it/content/quick-guide-intel-mic-usage CINECA/SCAI documentation].
 
Before using the Intel compilers or executing codes that dynamically link to Intel libraries, you have to set up your environment:
{{Term|location=graphite|cmd=<code class="command">source</code> /opt/intel/composerxe/bin/compilervars.sh intel64}}
 
==== Offload mode ====
 
In offload mode, your program is executed on the Host, but part of its execution is offloaded to the co-processor card.
Intel provided a code snippet on its [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment tutorial] that shows a sum reduction operation being run on a Xeon Phi processor.
This example is available on the /grid5000/xeonphi/samples/ directory and can be compiled and executed as follow:
 
{{Term|location=graphite|cmd=<code class="command">icpc</code> -openmp /grid5000/xeonphi/samples/reduction-offload.cpp -o reduction-offload}}
{{Term|location=graphite|cmd=<code class="command">./reduction-offload</code>}}
 
==== Native mode ====


In native mode, your program is completely executed on the Xeon Phi. Your code must be compiled natively for the Xeon Phi architecture using the -mmic option:
If you want to record your environment with the custom NVidia driver, see [[Advanced_Kadeploy#Create_a_new_environment_from_a_customized_environment]]


{{Term|location=graphite|cmd=<code class="command">icpc</code> -mmic /grid5000/xeonphi/samples/hello.cpp -o hello_mic}}
= AMD GPU on Grid'5000 =


This program cannot be ran on the Host:
As of October 2021, Grid'5000 has one cluster with AMD GPU: [[Lyon:Hardware#neowise|neowise cluster in Lyon]].
{{Term|location=graphite|cmd=<code class="command">./hello_mic </code>}}
  -bash: ./hello_mic: cannot execute binary file: Exec format error


To execute this program, you have to execute it directly on the MIC. You can ssh to the MIC and run the program:
A neowise GPU may be reserved using:
{{Term|location=graphite|cmd=<code class="command">ssh </code> mic0}}
{{Term|location=graphite-mic0|cmd=<code class="command"> source</code> /grid5000/xeonphi/micenv}}
{{Term|location=graphite-mic0|cmd=<code class="command"> ./hello_mic</code>}}


{{Note|text=Your home directory is available from the Xeon Phi (as well as the /grid5000 directory)}}
{{Term|location=flyon|cmd=<code class="command">oarsub</code> -t exotic -p neowise -l gpu=1 -I}}


=== BLAS examples ===
A full neowise node may be reserved using:


Download our BLAS' matrix-matrix multiplication example:
{{Term|location=flyon|cmd=<code class="command">oarsub</code> -t exotic -p neowise -I}}


{{Term|location=graphite|cmd=wget http://apt.grid5000.fr/tutorial/gpu/matmatmul.c}}
The default environment on neowise include part of AMD's [https://rocmdocs.amd.com/en/latest/index.html ''ROCm''] stack with AMD GPU driver and basic tools and libraries such as:
 
* <code class=command>rocm-smi</code> : get information about GPUs
The Intel MKL (Intel Math Kernel Library) library provides an implementation of the BLAS. Our BLAS example can be linked with the MKL:
* <code class=command>hipcc</code> : HIP compiler
 
* <code class=command>hipfy-perl</code> : CUDA to HIP code converter
{{Term|location=graphite|cmd=icpc matmatmul.c -o matmatmul_mkl_seq -DHAVE_MKL -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread}}
 
MKL also provides a threaded version of the BLAS that can be used as follow:
{{Term|location=graphite|cmd=icpc matmatmul.c -o matmatmul_mkl    -DHAVE_MKL -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -fopenmp}}
More information on the compilation options can be found [https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor here].
 
You can compare the performances of the different flavors:
{{Term|location=graphite|cmd=gcc -O3 -Wall -std=c99 matmatmul.c -o matmatmul_defaultblas -lblas}}
{{Term|location=graphite|cmd=./matmatmul_defaultblas 5000}}
  Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
  BLAS  - Time elapsed:  3.605E+01 sec.
 
{{Term|location=graphite|cmd=./matmatmul_mkl_seq 5000}}
  Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
  BLAS  - Time elapsed: 1.676E+01 sec.
 
{{Term|location=graphite|cmd=./matmatmul_mkl 5000}}
  Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
  BLAS  - Time elapsed:  1.855E+00 sec.
 
The MKL is also available natively on the Xeon Phi:
{{Term|location=graphite|cmd=icpc -mmic matmatmul.c -o matmatmul_mic  -DHAVE_MKL -L${MKLROOT}/lib/mic -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -fopenmp}}
{{Term|location=graphite|cmd=<code class="command">ssh </code> mic0}}
{{Term|location=graphite-mic0|cmd=<code class="command"> source</code> /grid5000/xeonphi/micenv}}
{{Term|location=graphite-mic0|cmd=<code class="command"> source</code> /grid5000/software/intel/mkl/bin/mklvars.sh mic}}
{{Term|location=graphite|cmd=./matmatmul_mic 5000}}
  Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
  BLAS  - Time elapsed:  1.200E+00 sec.
 
MKL also provides an [https://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf automatic offload mode] that can be compared to the NVBLAS library of NVIDIA GPU:
{{Term|location=graphite|cmd=export MKL_MIC_ENABLE=1}}
{{Term|location=graphite|cmd=export OFFLOAD_REPORT=1}}
{{Term|location=graphite|cmd=export MKL_MIC_DISABLE_HOST_FALLBACK=1}}
{{Term|location=graphite|cmd=./matmatmul_mkl 5000}}
<pre>
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
BLAS  - [MKL] [MIC --] [AO Function]        DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision]        0.35 0.65
 
[MKL] [MIC 00] [AO DGEMM CPU Time]        3.050871 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time]        0.541124 seconds
Time elapsed:  3.168E+00 sec.
</pre>


The data transfer cost between the host and the accelerator is amortized for larger matrices:
In addition, most libraries and development tools from ROCm and HIP (available at https://rocmdocs.amd.com/en/latest/Installation_Guide/Software-Stack-for-AMD-GPU.html) are available as [[Modules|modules]]. Deep Learning Frameworks pytorch and TensorFlow are also [[Deep Learning Frameworks#Deep_learning_with_AMD_GPUs|known to work]].
{{Term|location=graphite|cmd=MKL_MIC_ENABLE=0 ./matmatmul_mkl 10000}}
  Multiplying Matrices: C(10000x10000) = A(10000x10000) x B(10000x10000)
  BLAS  - Time elapsed:  9.304E+00 sec.
{{Term|location=graphite|cmd=MKL_MIC_ENABLE=1 ./matmatmul_mkl 10000}}
  Multiplying Matrices: C(10000x10000) = A(10000x10000) x B(10000x10000)
  BLAS  - Time elapsed:  6.157E+00 sec.


{{Pages|HPC}}
{{Pages|HPC}}

Latest revision as of 08:46, 29 June 2023

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Introduction

This tutorial presents how to use GPU Accelerators. You will learn to reserve these resources, setup the environment and execute codes on the accelerators. Please note that this page is not about GPU programming and only focuses on the specificities of the Grid'5000 platform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is especially useful for testing the latest version of the accelerator software stack (such as the NVIDIA CUDA libraries).

In this tutorial, we provide code examples that use the Level-3 BLAS function DGEMM to compute the product of the two matrices. BLAS libraries are available for a variety of computer architectures (including multicores and accelerators) and this code example is used on this tutorial as a toy benchmark to compare the performance of accelerators and/or available BLAS libraries.

For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the Getting Started tutorial first to get familiar with the platform (connections to the platform, resource reservations) and its basic concepts (job scheduling, environment deployment). The Hardware page is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see Status).

Note that Intel Xeon Phi KNC (MICs) available in Nancy are no longer supported (documentation remains available)

Nvidia GPU on Grid'5000

Note that NVIDIA drivers (see nvidia-smi) and CUDA (nvcc --version) compilation tools are installed by default on nodes.

Choosing a GPU

Have a look at per-site, detailed hardware pages (for instance, at Lyon), you will find here useful informations about GPUs:

  • the card model name (see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units to know more about each model)
  • the GPU memory size available for computations
  • for NVidia GPU, their compute capability
  • the hosting node characteristics (#cpu, qty of memory available, #gpus, reservable local disk availability, ...)
  • the job access conditions (ie: default or production queue, max walltime partition for clusters in the production queues)

About NVidia and CUDA compatibility with older GPUs

Most of GPU available in Grid'5000 are supported by Nvidia driver and CUDA delivered in Grid'5000 environments. As of October 2021, there are two exceptions:

  • K40m GPUs available in grimani cluster in Nancy requires the nvcc option ---gpu-architecture=sm_35 (35 for compute capability 3.5) to be used with CUDA starting from version 11, which is the version shipped with our debian11 environment.
  • M2075 GPUs (compute capability 2.0) of the orion cluster in Lyon is not supported by the driver shipped in our environments. GPUs in this cluster are no more usable from our environments and the gpu property used to select a GPU node using oarsub (see below) is disabled. Not that it is still possible for to build an environment with custom driver to use these cards.

See https://en.wikipedia.org/wiki/CUDA#GPUs_supported to know more about the relationship between Cuda versions and compute capability.

Reserving GPUs

Single GPU

If you only need a single GPU in the standard environment, reservation is as simple as:

Terminal.png frontend:
oarsub -I -l "gpu=1"
Note.png Note

On a multi-GPU node, this will give you only part of the memory and CPU resources. For instance, on a dual-GPU node, reserving a single GPU will give you access to half of the system memory and half of the CPU cores. This ensures that another user can reserve the other GPU and still have access to enough system memory and CPU cores.

In Nancy, you have to use the production queue for most of the GPU clusters, for instance:

Terminal.png frontend:
oarsub -I -q production -l "gpu=1"

If you require several GPUs for the same experiment (e.g. for inter-GPU communication or to distribute computation), you can reserve multiple GPUs of a single node:

Terminal.png frontend:
oarsub -I -l host=1/gpu=2
Note.png Note

When you run nvidia-smi, you will only see the GPU(s) you reserved, even if the node has more GPUs. This is the expected behaviour.

To select a specific model of GPU, two possibilities:

use gpu model aliases, as describe in OAR Syntax simplification#GPUs, e.g.

Terminal.png frontend:
oarsub -I -l gpu=1 -p gpu_alias

use the "gpu_model" property, e.g.

Terminal.png frontend:
oarsub -I -l gpu=1 -p "gpu_model = 'GPU model'"

The exact list of GPU models is available on the OAR properties page, and you can use Hardware page to have an overview of available GPUs on each site.

Reserving full nodes with GPUs

In some cases, you may want to reserve a complete node with all its GPUs. This allows you to customize the software environment with Sudo-g5k or even to deploy another operating system.

To make sure you obtain a node with a GPU, you can use the "gpu_count" property:

Terminal.png frontend:
oarsub -I -p "gpu_count > 0"

In Nancy, you have to use the production queue for most GPU clusters:

Terminal.png nancy:
oarsub -I -q production -p "gpu_count > 0"

To select a specific model of GPU, you can also use the "gpu_model" property, e.g.

Terminal.png frontend:
oarsub -I -p "gpu_model = 'GPU model'"

If you want to deploy an environment on the node, you should add the -t deploy option.

Note about AMD GPU

As of October 2021, AMD GPUs are available in a single Grid'5000 cluster, neowise, in Lyon. oarsub commands shown above could give you either NVidia or AMD GPUs. The gpu_model property may be used to filter between GPU vendors. For instance:

Terminal.png frontend:
oarsub -I -p "gpu_count > 0 AND gpu_model NOT LIKE 'Radeon%'"

will filter out Radeon GPUs (=AMD GPUs). See below for more information about AMD GPUs.

GPU usage tutorial

In this section, we will give an example of GPU usage under Grid'5000.

Every steps of this tutorial must be performed on a Nvidia GPU node.

Run the CUDA Toolkit examples

In this part, we are going compile and execute CUDA examples provided by Nvidia using CUDA Toolkit available on the default (standart) environment.

First, we retrieve the version of CUDA installed on the node:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

Version is 11.2. We are going to download the corresponding CUDA samples.

cd /tmp
git clone --depth 1 --branch v11.2 https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples

You can compile all the examples at once by running make:

make -j8

The compilation of all the examples is over when "Finished building CUDA samples" is printed.

Each example is available from its own directory, under Samples root directory (it can also be compiled separately from there).

You can first try the Device Query example located in Samples/deviceQuery/. It enumerates the properties of the CUDA devices present in the system.

/tmp/cuda-samples/Samples/deviceQuery/deviceQuery

Here is an example of the result on the chifflet cluster at Lille:

/tmp/cuda-samples/Samples/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.2 / 11.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.2 / 11.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 130 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : No
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.2, NumDevs = 2
Result = PASS

BLAS examples

We now run our BLAS example to illustrate GPU performance for dense matrix multiply.

The toolkit provides the CUBLAS library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available here and several advanced examples using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...).

The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides NVBLAS, a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be forcibly linked using the LD_PRELOAD environment variable.

To test NVBLAS, you can download and compile our matrix-matrix multiplication example:

Terminal.png node:
gcc -O3 -Wall -std=c99 matmatmul.c -o matmatmul -lblas

You can first check the performance of the BLAS library on the CPU. For small matrix size (<5000), the provided example will compare the BLAS implementation to a naive jki-loop version of the matrix multiplication:

Terminal.png node:
./matmatmul 2000
 Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000)
 BLAS  - Time elapsed:  1.724E+00 sec.
 J,K,I - Time elapsed:  7.233E+00 sec.

To offload the BLAS computation on the GPU, use:

Terminal.png node:
echo "NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libblas.so" > nvblas.conf
Terminal.png node:
LD_PRELOAD=libnvblas.so ./matmatmul 2000
 [NVBLAS] Config parsed
 Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000)
 BLAS  - Time elapsed:  1.249E-01 sec.

Depending on node hardware, GPU might perform better on larger problems:

Terminal.png node:
./matmatmul 5000
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  2.673E+01 sec.
Terminal.png node:
LD_PRELOAD=libnvblas.so ./matmatmul 5000
 [NVBLAS] Config parsed
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  1.718E+00 sec.

If you want to measure the time spent on data transfers to the GPU, have a look to the simpleCUBLAS (/tmp/cuda-samples/Samples/simpleCUBLAS) example and instrument the code with timers.

Custom CUDA version or Nvidia drivers

Here, we explain how to use other CUDA versions with Modules, use Nvidia Docker images and install the NVIDIA drivers and compilers before validating the installation on the previous example set.

Older or newer CUDA version using modules

Different CUDA versions can be loaded using "module" command. You should first choose the CUDA toolkit version that you will load with module tool:

Terminal.png node:
module av cuda

------------- /grid5000/spack/v1/share/spack/modules/linux-debian11-x86_64_v2 ----------------
   cuda/11.4.0_gcc-10.4.0    cuda/11.6.2_gcc-10.4.0    cuda/11.7.1_gcc-10.4.0 (D)

Terminal.png node:
module load cuda/11.6.2_gcc-10.4.0
Terminal.png node:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

You should consult CUDA Toolkit and Compatible Driver Versions to ensure compatibility with a specific Cuda version and the Nvidia GPU driver (for instance, Cuda 11.x toolkit requires a driver version >= 450.80.02)

Copy and compile the sample examples

You now have everything installed. For instance, you can compile and run the toolkit examples (see #Compiling the CUDA Toolkit examples for more information).

You will need to override the CUDA path variable, and also load the matching compiler version from modules:

Terminal.png node:
which nvcc
/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e/bin/nvcc
Terminal.png node:
export CUDA_PATH=/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e
Terminal.png node:
module load gcc/10.4.0_gcc-10.4.0

And then you can build and run the examples:

Terminal.png node:
git clone --depth 1 --branch v11.6 https://github.com/NVIDIA/cuda-samples.git /tmp/cuda-samples
Terminal.png node:
cd /tmp/cuda-samples
Terminal.png node:
make -j32
Terminal.png node:
./Samples/0_Introduction/matrixMul/matrixMull
Note.png Note

Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example

Nvidia-docker

A script to install nvidia-docker is available if you want to use Nvidia's images builded for Docker and GPU nodes. This provides an alternative way of making CUDA and Nvidia libraries available to the node. See Nvidia Docker page.

Custom Nvidia driver using deployment

A custom Nvidia driver may be installed on a node if needed. As root privileges are required, we will use kadepoy to deploy a debian11-x64-nfs environment on the GPU node you reserved.

This environment allows you to connect either as root (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:

Terminal.png frontend:
oarsub -I -t deploy -p "gpu_count > 0" -l /nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODE_FILE -e debian11-x64-nfs -k

Once the deployment is terminated, you should be able to connect to the node as root:

Terminal.png frontend:
ssh root@`head -1 $OAR_NODE_FILE`

You can then perform the NVIDIA driver installation:

Terminal.png node:
apt-get -y install linux-headers-amd64 make g++
Terminal.png node:
rmmod nouveau
Terminal.png node:
sh NVIDIA-Linux-x86_64-470.82.01.run -s --no-install-compat32-libs

(warnings about X.Org can safely be ignored)

On the node you can check which NVIDIA drivers are installed with the nvidia-smi tool:

Terminal.png node:
nvidia-smi

Here is an example of the result on the graphique cluster:

root@graphique-4:~# nvidia-smi
Tue Jun 27 19:37:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 26%   28C    P0    46W / 180W |      0MiB /  4043MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:82:00.0 Off |                  N/A |
| 28%   27C    P0    43W / 180W |      0MiB /  4043MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If you want to record your environment with the custom NVidia driver, see Advanced_Kadeploy#Create_a_new_environment_from_a_customized_environment

AMD GPU on Grid'5000

As of October 2021, Grid'5000 has one cluster with AMD GPU: neowise cluster in Lyon.

A neowise GPU may be reserved using:

Terminal.png flyon:
oarsub -t exotic -p neowise -l gpu=1 -I

A full neowise node may be reserved using:

Terminal.png flyon:
oarsub -t exotic -p neowise -I

The default environment on neowise include part of AMD's ROCm stack with AMD GPU driver and basic tools and libraries such as:

  • rocm-smi : get information about GPUs
  • hipcc : HIP compiler
  • hipfy-perl : CUDA to HIP code converter

In addition, most libraries and development tools from ROCm and HIP (available at https://rocmdocs.amd.com/en/latest/Installation_Guide/Software-Stack-for-AMD-GPU.html) are available as modules. Deep Learning Frameworks pytorch and TensorFlow are also known to work.