GPUs on Grid5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
Line 22: Line 22:
= GPU accelerators on Grid'5000 =
= GPU accelerators on Grid'5000 =


In this section, we first reserve a GPU node. We then compile and execute examples provided by the CUDA Toolkit on the default (production) environment. We also run our BLAS example to illustrate GPU performance for dense matrix multiply. Finally, we deploy a jessie-x64-base environment and install the NVIDIA drivers and compilers before validating the installation on the previous example set.  
In this section, we first reserve a GPU node. We then compile and execute examples provided by the CUDA Toolkit on the default (production) environment. We also run our BLAS example to illustrate GPU performance for dense matrix multiply. Finally, we explain you how use latest CUDA version with module feature and install the NVIDIA drivers and compilers before validating the installation on the previous example set.  


== A note on NVIDIA drivers and CUDA support ==
== A note on NVIDIA drivers and CUDA support ==


NVIDIA drivers (see <code class="command">nvidia-smi</code>) and CUDA 9.0 (<code class="command">nvcc --version</code>) compilation tools are installed by default on nodes. NVIDIA does not support older generation of GPU but this environment will work with GPU accelerators available on Grid'5000, as those currently installed on chifflet (Lille) and graphique (in the [[Nancy:Production|production queue]] of Nancy) are quite recent.
NVIDIA drivers (see <code class="command">nvidia-smi</code>) and CUDA 9.0 (<code class="command">nvcc --version</code>) compilation tools are installed by default on nodes. NVIDIA does not support older generation of GPU but this environment will work with GPU accelerators available on Grid'5000, as those currently installed on chifflet (Lille) and graphique (in the [[Nancy:Production|production queue]] of Nancy) are quite recent. See this [http://www.nvidia.com/Download/index.aspx?lang=en-us page] to find which versions fit you GPU.
 
The following table summarizes the situation on Grid'5000 as of April 2019:
 
{| class="program" style="border:1px dotted black;"
! Site
! Cluster
! GPU
! OAR properties
! Latest Driver version
! Latest CUDA version
|-
| Lyon
| orion (4 nodes)
| Nvidia Tesla-M2075 (1 per node)
| -p "GPU!='NO'"
| 396.44
| 9.0
|-
| Nancy
| graphique (6 nodes)
| Nvidia Titian Black GPU (2 on graphique-1),<br />
Nvidia GTX 980 GPU (2 per node)
| -p "GPU!='NO'" -q production
| 396.44
| 9.0
|-
| Nancy
| grele (14 nodes)
| Nvidia GTX 1080Ti (2 per node)
| -p "GPU!='NO'" -q production
| 396.44
| 9.0
|-
| Nancy
| grimani (6 nodes)
| Nvidia K40m GPU (2 per node)
| -p "GPU!='NO'" -q production
| 396.44
| 9.0
|-
| Lille
| chifflet (8 nodes)
| Nvidia GTX 1080Ti (2 per node)
| -p "GPU!='NO'"
| 396.44
| 9.0
|-
| Lille
| chifflot (8 nodes)
| Nvidia Tesla P100 (2 on chifflot-[1-6]),<br />
Nvidia Tesla V100 (2 on chifflot-[7-8])
| -p "GPU!='NO'"
| 396.44
| 9.0
|}
 


== Selection of GPU nodes ==
== Selection of GPU nodes ==
Line 177: Line 233:
If you want to measure the time spent on data transfers to the GPU, you can use the simpleCUBLAS (<code class="file">/tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS</code>) example and instrument the code with timers.
If you want to measure the time spent on data transfers to the GPU, you can use the simpleCUBLAS (<code class="file">/tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS</code>) example and instrument the code with timers.


== Use the latest CUDA toolkit version ==
=== Deployment ===
First, reserve a GPU node and deploy the <code class="file">debian9-x64-nfs</code> environment. This environment allows you to connect either as ''root'' (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "GPU!='<code class="replace">NO</code>'" -l /nodes=1,walltime=2}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e debian9-x64-nfs -k}}
Once the deployment is terminated, you should be able to connect to the node as root:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}
=== Install CUDA toolkit and drivers ===
You should choose the CUDA toolkit version that you will load with module feature
{{Term|location=node|cmd= module av}}
<pre>
----------------------- /grid5000/spack/share/spack/modules/linux-debian9-x86_64 -----------------------
cuda/10.0.130_gcc-6.4.0                        likwid/4.3.0_gcc-6.4.0
cuda/7.5.18_gcc-6.4.0                          likwid/4.3.2_gcc-6.4.0
cuda/8.0.61_gcc-6.4.0                          llvm/7.0.1_gcc-6.4.0
cuda/9.0.176_gcc-6.4.0                        memkind/1.7.0_gcc-6.4.0
cuda/9.1.85_gcc-6.4.0                          miniconda2/4.5.11_gcc-6.4.0
cuda/9.2.88_gcc-6.4.0                          miniconda3/4.5.11_gcc-6.4.0
...
</pre>
{{Term|location=node|cmd= module load cuda/10.0.130_gcc-6.4.0}}
{{Term|location=node|cmd= nvcc --version}}
<pre>
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
</pre>
After you should consult [https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver CUDA Toolkit and Compatible Driver Versions] table and donwload the corresponding installer. (Cuda 10.0.x toolkit: driver version >= 410.48)
{{Term|location=node|cmd=<code class="command">wget https://download.nvidia.com/XFree86/Linux-x86_64/410.57/NVIDIA-Linux-x86_64-410.57.run</code>}}
On the node you can check which NVIDIA drivers are installed with the <code class="command">nvidia-smi</code> tool:
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}}
Here is an example of the result on the chifflet cluster:
<pre>
chifflet-8:~$ nvidia-smi
Tue Apr  9 15:11:28 2019     
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|  0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 23%  21C    P8    9W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  1  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 23%  20C    P8    8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                             
+-----------------------------------------------------------------------------+
| Processes:                                                      GPU Memory |
|  GPU      PID  Type  Process name                            Usage      |
|=============================================================================|
|  No running processes found                                                |
+-----------------------------------------------------------------------------+
</pre>
<!--
=== Installing the CUDA toolkit on a deployed environment ===
=== Installing the CUDA toolkit on a deployed environment ===


GPU nodes at Lyon and Nancy are supported by the latest GPU drivers. See this [http://www.nvidia.com/Download/index.aspx?lang=en-us page] to find which versions fit you GPU. The following table summarizes the situation on Grid'5000 as of March 2017:
GPU nodes at Lyon and Nancy are supported by the latest GPU drivers. See this [http://www.nvidia.com/Download/index.aspx?lang=en-us page] to find which versions fit you GPU. The following table summarizes the situation on Grid'5000 as of April 2019:


{| class="program" style="border:1px dotted black;"
{| class="program" style="border:1px dotted black;"
Line 193: Line 327:
| Nvidia Tesla-M2075 (1 per node)
| Nvidia Tesla-M2075 (1 per node)
| -p "GPU!='NO'"
| -p "GPU!='NO'"
| 384.xx
| 396.44
| 9.0
| 9.0
|-
|-
Line 201: Line 335:
Nvidia GTX 980 GPU (2 per node)
Nvidia GTX 980 GPU (2 per node)
| -p "GPU!='NO'" -q production
| -p "GPU!='NO'" -q production
| 384.xx
| 396.44
| 9.0
| 9.0
|-
|-
Line 208: Line 342:
| Nvidia K40m GPU (2 per node)
| Nvidia K40m GPU (2 per node)
| -p "GPU!='NO'" -q production
| -p "GPU!='NO'" -q production
| 384.xx
| 396.44
| 9.0
| 9.0
|-
|-
Line 215: Line 349:
| Nvidia GTX 1080Ti (2 per node)
| Nvidia GTX 1080Ti (2 per node)
| -p "GPU!='NO'"
| -p "GPU!='NO'"
| 384.xx
| 396.44
| 9.0
| 9.0
|}
|}
Line 316: Line 450:
{{Note|text=Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example}}
{{Note|text=Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example}}
{{Warning|text= Please note that /tmp is erased at the boot of the node. If you want to use tgz-g5k to save your CUDA installation, make sure to copy everything you need from /tmp to another directory or you won't retrieve it after deploying your custom envrionement}}
{{Warning|text= Please note that /tmp is erased at the boot of the node. If you want to use tgz-g5k to save your CUDA installation, make sure to copy everything you need from /tmp to another directory or you won't retrieve it after deploying your custom envrionement}}
-->


= Intel Xeon Phi (MIC) on Grid'5000 =
= Intel Xeon Phi (MIC) on Grid'5000 =

Revision as of 14:49, 9 April 2019

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Introduction

This tutorial presents how to use GPU Accelerators and Intel Xeon Phi Coprocessors on Grid'5000. You will learn to reserve these resources, setup the environment and execute codes on the accelerators. Please note that this page is not about GPU or Xeon Phi programming and only focuses on the specificities of the Grid'5000 platform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is especially useful for testing the latest version of the accelerator software stack (such as the NVIDIA CUDA libraries or the Intel Manycore Platform Software Stack (MPSS)).

In this tutorial, we provide code examples that use the Level-3 BLAS function DGEMM to compute the product of the two matrices. BLAS libraries are available for a variety of computer architectures (including multicores and accelerators) and this code example is used on this tutorial as a toy benchmark to compare the performance of accelerators and/or available BLAS libraries.

This tutorial is divided into two distinct parts that can be done in any order:

For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the Getting Started tutorial first to get familiar with the platform (connections to the platform, resource reservations) and its basic concepts (job scheduling, environment deployment). The Hardware page is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see Status).

GPU accelerators on Grid'5000

In this section, we first reserve a GPU node. We then compile and execute examples provided by the CUDA Toolkit on the default (production) environment. We also run our BLAS example to illustrate GPU performance for dense matrix multiply. Finally, we explain you how use latest CUDA version with module feature and install the NVIDIA drivers and compilers before validating the installation on the previous example set.

A note on NVIDIA drivers and CUDA support

NVIDIA drivers (see nvidia-smi) and CUDA 9.0 (nvcc --version) compilation tools are installed by default on nodes. NVIDIA does not support older generation of GPU but this environment will work with GPU accelerators available on Grid'5000, as those currently installed on chifflet (Lille) and graphique (in the production queue of Nancy) are quite recent. See this page to find which versions fit you GPU.

The following table summarizes the situation on Grid'5000 as of April 2019:

Site Cluster GPU OAR properties Latest Driver version Latest CUDA version
Lyon orion (4 nodes) Nvidia Tesla-M2075 (1 per node) -p "GPU!='NO'" 396.44 9.0
Nancy graphique (6 nodes) Nvidia Titian Black GPU (2 on graphique-1),

Nvidia GTX 980 GPU (2 per node)

-p "GPU!='NO'" -q production 396.44 9.0
Nancy grele (14 nodes) Nvidia GTX 1080Ti (2 per node) -p "GPU!='NO'" -q production 396.44 9.0
Nancy grimani (6 nodes) Nvidia K40m GPU (2 per node) -p "GPU!='NO'" -q production 396.44 9.0
Lille chifflet (8 nodes) Nvidia GTX 1080Ti (2 per node) -p "GPU!='NO'" 396.44 9.0
Lille chifflot (8 nodes) Nvidia Tesla P100 (2 on chifflot-[1-6]),

Nvidia Tesla V100 (2 on chifflot-[7-8])

-p "GPU!='NO'" 396.44 9.0


Selection of GPU nodes

You can reserve a GPU node by simply requesting resources with the OAR "GPU" property:

Terminal.png frontend:
oarsub -I -p "GPU!='NO'"

At Nancy, you have to use the production queue to get resources from graphique. The exact usage policy for this machine is still to be determined.

Terminal.png nancy:
oarsub -I -p "GPU!='NO'" -q production

Copy the CUDA Toolkit examples

We copy CUDA 9.0 samples and extract them on /tmp/samples:

Terminal.png node:
cp -r /usr/local/cuda/samples /tmp


Note.png Note

These samples are part of the CUDA 9.0 Toolkit and can also be extracted from the toolkit installer using the --extract=/path option.

The CUDA examples are described in /tmp/samples/Samples.html. You might also want to have a look at the doc/ directory or the online documentation.

Compiling the CUDA Toolkit examples

You can compile all the examples at once. From the CUDA samples source directory (/tmp/samples), run make to compile examples:

Terminal.png node:
cd /tmp/samples
Terminal.png node:
make -j8

The compilation of all the examples is over when "Finished building CUDA samples" is printed. Each example can also be compiled separately from its own directory.

You can first try the Device Query example located in /tmp/samples/1_Utilities/deviceQuery/. It enumerates the properties of the CUDA devices present in the system.

Terminal.png node:
/tmp/samples/1_Utilities/deviceQuery/deviceQuery

Here is an example of the result on the graphique cluster at Nancy:

/tmp/samples/1_Utilities/deviceQuery/deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN Black"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 6082 MBytes (6377766912 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            980 MHz (0.98 GHz)
  Memory Clock rate:                             3500 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX TITAN Black"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 6082 MBytes (6377766912 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            980 MHz (0.98 GHz)
  Memory Clock rate:                             3500 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 130 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX TITAN Black (GPU0) -> GeForce GTX TITAN Black (GPU1) : No
> Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX TITAN Black (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2
Result = PASS


BLAS examples

The toolkit provides the CUBLAS library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available here and several advanced examples using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...).

The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides NVBLAS, a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be forcibly linked using the LD_PRELOAD environment variable.

To test NVBLAS, you can download and compile our matrix-matrix multiplication example:

Terminal.png node:
gcc -O3 -Wall -std=c99 matmatmul.c -o matmatmul -lblas

You can first check the performance of the BLAS library on the CPU. For small matrix size (<5000), the provided example will compare the BLAS implementation to a naive jki-loop version of the matrix multiplication:

Terminal.png node:
./matmatmul 2000
 Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000)
 BLAS  - Time elapsed:  1.724E+00 sec.
 J,K,I - Time elapsed:  7.233E+00 sec.

To offload the BLAS computation on the GPU, use:

Terminal.png node:
echo "NVBLAS_CPU_BLAS_LIB /usr/lib/libblas/libblas.so" > nvblas.conf
Terminal.png node:
LD_PRELOAD=libnvblas.so ./matmatmul 2000
 [NVBLAS] Config parsed
 Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000)
 BLAS  - Time elapsed:  1.249E-01 sec.

CPU/GPU comparisons becomes more meaningful with larger problems:

Terminal.png node:
./matmatmul 5000
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  2.673E+01 sec.
Terminal.png node:
LD_PRELOAD=libnvblas.so ./matmatmul 5000
 [NVBLAS] Config parsed
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  1.718E+00 sec.

If you want to measure the time spent on data transfers to the GPU, you can use the simpleCUBLAS (/tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS) example and instrument the code with timers.

Use the latest CUDA toolkit version

Deployment

First, reserve a GPU node and deploy the debian9-x64-nfs environment. This environment allows you to connect either as root (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:

Terminal.png frontend:
oarsub -I -t deploy -p "GPU!='NO'" -l /nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODE_FILE -e debian9-x64-nfs -k

Once the deployment is terminated, you should be able to connect to the node as root:

Terminal.png frontend:
ssh root@`head -1 $OAR_NODE_FILE`

Install CUDA toolkit and drivers

You should choose the CUDA toolkit version that you will load with module feature

Terminal.png node:
module av
----------------------- /grid5000/spack/share/spack/modules/linux-debian9-x86_64 -----------------------
cuda/10.0.130_gcc-6.4.0                        likwid/4.3.0_gcc-6.4.0
cuda/7.5.18_gcc-6.4.0                          likwid/4.3.2_gcc-6.4.0
cuda/8.0.61_gcc-6.4.0                          llvm/7.0.1_gcc-6.4.0
cuda/9.0.176_gcc-6.4.0                         memkind/1.7.0_gcc-6.4.0
cuda/9.1.85_gcc-6.4.0                          miniconda2/4.5.11_gcc-6.4.0
cuda/9.2.88_gcc-6.4.0                          miniconda3/4.5.11_gcc-6.4.0
...
Terminal.png node:
module load cuda/10.0.130_gcc-6.4.0
Terminal.png node:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

After you should consult CUDA Toolkit and Compatible Driver Versions table and donwload the corresponding installer. (Cuda 10.0.x toolkit: driver version >= 410.48)

On the node you can check which NVIDIA drivers are installed with the nvidia-smi tool:

Terminal.png node:
nvidia-smi

Here is an example of the result on the chifflet cluster:

chifflet-8:~$ nvidia-smi 
Tue Apr  9 15:11:28 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 23%   21C    P8     9W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 23%   20C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+





Intel Xeon Phi (MIC) on Grid'5000

Note.png Note

This tutorial concerns the Knight Corner generation of the MIC (Intel Many Integrated Core Architecture), which are coprocessor cards (PCIe). At the time of writing this, Grid'5000 does not provide MIC of the next generations (e.g. KNL).

Reserve a Xeon Phi at Nancy

As NVIDIA GPU, Xeon Phi coprocessor cards provide additional compute power and can be used to offload computations. As those extension cards run a modified Linux kernel, it is also possible to log in directly onto the Xeon Phi via ssh. Also, it is possible to compile application for the Xeon Phi processor (which is based on x86 technology) and runs it natively on the embedded Linux system of the Xeon Phi card.

Xeon Phi 7120P are available at Nancy.

Since the current Xeon Phi available in Grid'5000 are already pretty old, they are not anymore supported by the default environment (OS) provided on nodes. As a result, one must deploy the previous version of Grid'5000 environments (based on Debian8 Jessie) on nodes in order to have the Mic stack installed.

To reserve a Grid'5000 node that includes a Xeon Phi, you can use this command:

Terminal.png frontend:
oarsub -I -p "MIC='YES'" -t deploy

Then, nodes need to be deployed with the Jessie Big environment:

Terminal.png frontend:
kadeploy3 -e jessie-x64-big -f $OAR_NODE_FILE -k

You should then be able to ssh to the nodes and check the status of the MIC card using micinfo:

Terminal.png graphite:
micinfo
Warning.png Warning

If you want to make sure no one else ssh to your deployed node, connect to the node as root and run:

Terminal.png graphite:
echo '-:ALL EXCEPT your_g5k_username root:ALL' >> /etc/security/access.conf

Configuring the Intel compiler to use a license server

Intel compilers are the most appropriate compilers for the Intel Xeon Phi. Intel compilers are available in /grid5000/compilers/icc13.2/ at Nancy, but they require a commercial (or academic) license and such licenses is not provided by Grid'5000. However, you may have access to a license server in your local laboratory.

For instance, if you are inside Inria's network (or if using the Inria VPN), you have access to the Inria license server (jetons.inria.fr). If so, you can use your workstation as a bridge between the license server of your local network and your Grid'5000 nodes by creating an SSH tunnel. This procedure is explained below.

An other option is to compile your programs somewhere where the Intel compiler is available, and then copy your executable binary (compiled code) to your Grid'5000 nodes (beware of the CPU architecture homogeneity however).

Note that GCC and Clang also provide a limited support for newest Xeon Phi. See this page for more information.

Using a license server

In the following, we will setup a SSH tunnel between a license server and a Grid'5000 node (graphite-X). The Intel compilers will be configured to use localhost:28618 as the license server and the SSH tunnel will forward connections from localhost:28618 to the license server (you can use any local port number for this). On the following, we use the Inria license server named jetons.inria.fr, ports 29030 and 34430, and take as an hypothesis that your workstation is connected to Inria's network.

In your NFS home directory in Nancy, create a license configuration file for the Intel compilers:

Terminal.png frontend:
mkdir ~/intel
cat <<EOF >> ~/intel/licenses 
SERVER localhost ANY 28618
USE_SERVER
EOF

Then, start an SSH tunnel from your workstation (reminder: in this example your workstation must be connected to Inria network in order to have access to jetons.inria.fr):

Terminal.png workstation:
ssh -R 28618:jetons.inria.fr:29030 -R 34430:jetons.inria.fr:34430 graphite-X.nancy.g5k

The previous command open a shell session that can be used directly. You should keep it open as long as you need the Intel compilers.

You can also add the tunnel setup to your configuration file (.ssh/config):

Host g5k
 Hostname access.grid5000.fr
 
 [...]
  
Host *.intel
 User your_g5k_username
 ForwardAgent no
 RemoteForward *:28618 jetons.inria.fr:29030
 RemoteForward *:34430 jetons.inria.fr:34430
 ProxyCommand ssh g5k -W "$(basename %h .intel):%p"

Then, to create the tunnel and connect to your node, you can simply use:

Terminal.png workstation:
ssh graphite-X.nancy.intel

To test the tunnel, you can do:

Terminal.png graphite:
source /opt/intel/composerxe/bin/compilervars.sh intel64
Terminal.png graphite:
icc -v

Using Intel compilers on Grid'5000 can be rather slow because of the license server.

Execution on Xeon Phi

Setup

Note.png Note

Since MIC are not supported anymore in the standard environment, the setup becomes more complex and involves several root level technical steps

First we have to make sure the MIC is able to access the Grid'5000 network. For that we can setup NAT on the host. The following commands have to be run as root on the host.

Terminal.png graphite:
iptables -t nat -A POSTROUTING ! -o mic0 -j MASQUERADE
Terminal.png graphite:
echo 1 > /proc/sys/net/ipv4/ip_forward

Then, we have to make sure the MIC is booted. You may look at the output of the dmesg command, and if it is not booted, restart the MPSS service:

Terminal.png graphite:
service mpss restart
Warning.png Warning

As of Mar. 19th 2018, there is a bug in /etc/hosts of the environments: a newline is missing (bug #9125), you must fix it.

Look again at the dmesg command output and wait for the following line:

 mic0: Transition from state booting to online

You should now be able to SSH to the mic as root. Make sure NFS is mounted

Terminal.png graphite-mic0:
mount -a

In order to connect to the Xeon Phi card with your Grid'5000 username using SSH, you have to setup your user. Run the following commands from the host as root:

Terminal.png graphite:
T=($(getent passwd your_g5k_username | tr ' ' _ | tr : ' '))

then

Terminal.png graphite:
micctrl --useradd=${T[0]} --uid=${T[2]} --gid=${T[3]} --sshkeys=${T[5]}/.ssh

Using the MIC

An introduction to the Xeon Phi programming environment is available on the Intel website. Other useful resources include:

Before using the Intel compilers or executing codes that dynamically link to Intel libraries, you have to set up your environment:

Terminal.png graphite:
source /opt/intel/composerxe/bin/compilervars.sh intel64

Offload mode

In offload mode, your program is executed on the Host, but part of its execution is offloaded to the co-processor card. Intel provided a code snippet on its tutorial that shows a sum reduction operation being run on a Xeon Phi processor. This example is available on the /grid5000/xeonphi/samples/ directory and can be compiled and executed as follow:

Terminal.png graphite:
icpc -openmp /grid5000/xeonphi/samples/reduction-offload.cpp -o reduction-offload
Terminal.png graphite:
./reduction-offload

Native mode

In native mode, your program is completely executed on the Xeon Phi. Your code must be compiled natively for the Xeon Phi architecture using the -mmic option:

Terminal.png graphite:
icpc -mmic /grid5000/xeonphi/samples/hello.cpp -o hello_mic

This program cannot be ran on the Host:

Terminal.png graphite:
./hello_mic
 -bash: ./hello_mic: cannot execute binary file: Exec format error

To execute this program, you have to execute it directly on the MIC. You can ssh to the MIC and run the program:

Terminal.png graphite:
ssh mic0
Terminal.png graphite-mic0:
source /grid5000/xeonphi/micenv
Terminal.png graphite-mic0:
./hello_mic
Note.png Note

Your home directory is available from the Xeon Phi (as well as the /grid5000 directory)

BLAS examples

Download our BLAS' matrix-matrix multiplication example:

The Intel MKL (Intel Math Kernel Library) library provides an implementation of the BLAS. Our BLAS example can be linked with the MKL:

Terminal.png graphite:
icpc matmatmul.c -o matmatmul_mkl_seq -DHAVE_MKL -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread

MKL also provides a threaded version of the BLAS that can be used as follow:

Terminal.png graphite:
icpc matmatmul.c -o matmatmul_mkl -DHAVE_MKL -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -fopenmp

More information on the compilation options can be found here.

You can compare the performances of the different flavors:

Terminal.png graphite:
gcc -O3 -Wall -std=c99 matmatmul.c -o matmatmul_defaultblas -lblas
Terminal.png graphite:
./matmatmul_defaultblas 5000
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  3.605E+01 sec.
Terminal.png graphite:
./matmatmul_mkl_seq 5000
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  1.676E+01 sec.
Terminal.png graphite:
./matmatmul_mkl 5000
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  1.855E+00 sec.

The MKL is also available natively on the Xeon Phi:

Terminal.png graphite:
icpc -mmic matmatmul.c -o matmatmul_mic -DHAVE_MKL -L${MKLROOT}/lib/mic -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -fopenmp
Terminal.png graphite:
ssh mic0
Terminal.png graphite-mic0:
source /grid5000/xeonphi/micenv
Terminal.png graphite-mic0:
source /grid5000/software/intel/mkl/bin/mklvars.sh mic
Terminal.png graphite:
./matmatmul_mic 5000
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  1.200E+00 sec.

MKL also provides an automatic offload mode that can be compared to the NVBLAS library of NVIDIA GPU:

Terminal.png graphite:
export MKL_MIC_ENABLE=1
Terminal.png graphite:
export OFFLOAD_REPORT=1
Terminal.png graphite:
export MKL_MIC_DISABLE_HOST_FALLBACK=1
Terminal.png graphite:
./matmatmul_mkl 5000
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
BLAS  - [MKL] [MIC --] [AO Function]        DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision]        0.35 0.65

[MKL] [MIC 00] [AO DGEMM CPU Time]        3.050871 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time]        0.541124 seconds
Time elapsed:  3.168E+00 sec.

The data transfer cost between the host and the accelerator is amortized for larger matrices:

Terminal.png graphite:
MKL_MIC_ENABLE=0 ./matmatmul_mkl 10000
 Multiplying Matrices: C(10000x10000) = A(10000x10000) x B(10000x10000)
 BLAS  - Time elapsed:  9.304E+00 sec.
Terminal.png graphite:
MKL_MIC_ENABLE=1 ./matmatmul_mkl 10000
 Multiplying Matrices: C(10000x10000) = A(10000x10000) x B(10000x10000)
 BLAS  - Time elapsed:  6.157E+00 sec.