GPUs on Grid5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
(New tuto)
Line 1: Line 1:
{{Maintainer|Simon Delamare}}
{{Maintainer|Simon Delamare}}
{{Maintainer|Emile Morel}}
{{Author|Elodie Bertoncello}}
{{Author|Elodie Bertoncello}}
{{Author|Emile Morel}}
{{Author|Jérémie Gaidamour}}
{{Portal|User}}
{{Portal|User}}


= Purpose =
= Introduction =


This page presents usage of computing accelerators, such as GPU and Xeon Phi, on Grid'5000. You will learn to reserve these resources and execute code on them. If you prefer to deploy your own environment (for instance to install latest drivers), some guidelines are also provided.
This tutorial presents how to use GPU Accelerators and Intel Xeon Phi Coprocessors on Grid'5000. You will learn to reserve these resources, setup the environnement and execute codes on the accelerators. The dense matrix-matrix multiplication example of this tutorial can be used as a toy benchmark to compare the performance of accelerators and/or [https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms BLAS] implementations. Please note that this page is not about GPU or Xeon Phi programming and only focus on the specificities of the Grid'5000 plateform. In particular,  Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is useful for either testing the latest version of a driver or ensuring the reproducibility of your experiments (by freezing its context).


Please note that this tutorial is not about GPU or Xeon Phi programming. Many documents are available on the Web on this subject.
This tutorial is divided into two distinct parts that can be done in any order:
* [[#GPU accelerators on Grid'5000]]
* [[#Intel Xeon Phi (MIC) on Grid'5000]]


For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the [[Getting Started]] tutorial first to get familiar with the plateform (connections to the plateform, resource reservations) and its basic concepts (job scheduling, environment deployment).
[[Special:G5KHardware]] is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see [Status]).


= GPU accelerators on Grid'5000 =


= Pre-requisite =
In this section, we first reserve a GPU node. We then compile and execute examples provided by the CUDA Toolkit on the default (production) environment. We also run our [https://developer.nvidia.com/cublas|CUBLAS] example to illustrate GPU performance for dense matrix multiply. Finally, we deploy a jessie-x64-base environment and install the NVIDIA drivers and compilers before validating the installation on the previous example set.


* A basic knowledge of Grid'5000 is required, we suggest you read the [[Getting Started]] tutorial first.
http://docs.nvidia.com/cuda/cuda-samples/index.html#matrix-multiplication--cublas-
* Information about hardware accelerators availability can be found on [[Special:G5KHardware]].


= GPU on Grid'5000 =
== Selection of GPU nodes ==


In this section, we will first compile and use CUDA examples and in the second part, we will install NVIDIA drivers and compile CUDA 5 from a simple wheezy-x64-base environment.
You can reserve a GPU node by simply requesting resources with the OAR "GPU" property:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "GPU='YES'"}}
 
At Lille (on chirloute), you have to use the GPU='SHARED" property instead:
{{Term|location=lille|cmd=<code class="command">oarsub</code> -I -p "GPU='SHARED'"}}
The reason is that those GPUs shared enclosures by groups of four and can only be rebooted in groups. You may encounter some difficulties on those shared GPUs but if
nvidia-smi -q does not detect the GPU on your node, you can find troubleshooting information on [[Lille:GPU|this page]].


At Nancy, you have to use the production queue to get ressources from graphite (and you also have to comply with the [UserCharter|usage policiy] of production resources)
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "GPU='YES'" -q production}}


== Using CUDA ==
NVIDIA drivers 346.22 (see `nvidia-smi`) and CUDA 7.0 (`nvcc --version`) compilation tools are installed by default on nodes. This version of the drivers only support the most recent GPU accelerators such as the GPU installed on orion (Lyon) and graphite (in the [Nancy:Production|production queue]] of Nancy).
You can use the GPU accelerators of adonis (Grenoble) and chirloute (Lille) with the NVIDIA 340.xx Legacy drivers on deployed environments. You can deploy the ready-to-use wheezy-x64-prod environment or see [#Installing the CUDA toolkit on a deployed environement] for using those GPUs with Debian Jessie.


=== Download and compile examples ===
== Downloading the CUDA Toolkit examples ==


GPU are available in:  
We download CUDA 7.0 samples and extract them on /tmp/samples:
* Grenoble (adonis)
{{Term|location=node|cmd=cd /tmp; <code class="command">wget</code> http://git.grid5000.fr/sources/cuda-samples-linux-7.0.28-19326674.run}}
* Lyon (orion)
{{Term|location=node|cmd=<code class="command">sh</code> cuda-samples-linux-7.0.28-19326674.run -noprompt -prefix=/tmp/samples}}
* Lille (chirloute)
* Nancy (in the [[Nancy:Production|production queue]])


NVIDIA drivers 346.22 (see `nvidia-smi`) and CUDA 7.0 (`nvcc --version`) compilation tools are installed by default on nodes.<br />
{{Note|text=These samples are part of the [https://developer.nvidia.com/cuda-toolkit-70 CUDA 7.0 Toolkit] and can also be extracted from the toolkit installer using the ''--extract=/path'' option.}}
You can reserve a node with GPU using OAR GPU property. For Grenoble and Lyon:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "GPU='YES'"}}
or for Lille:
{{Term|location=lille|cmd=<code class="command">oarsub</code> -I -p "GPU='SHARED'"}}


{{Warning|text=Please note that Lille GPU are shared between nodes and the oarsub command will differ from Grenoble or Lyon. When using GPUs at Lille, you may encounter some trouble, you can read more about Lille GPU on [[Lille:GPU]].}}
{{Note|text= On adonis and chirloute, install the CUDA 5.0 samples (cuda-samples_5.0.35_linux.run).}}


We will then download CUDA 7.0 samples and install them.
The CUDA examples are described in <code class="file">/tmp/samples/Samples.html</code>. You might also want to have a look at the <code class="file">doc</code> directory or the [http://docs.nvidia.com/cuda/cuda-samples/index.html#getting-started-with-cuda-samples online documentation].
{{Term|location=node|cmd=cd /tmp; <code class="command">wget</code> http://git.grid5000.fr/sources/cuda-samples-linux-7.0.28-19326674.run}}
{{Term|location=node|cmd=<code class="command">sh</code> cuda-samples-linux-7.0.28-19326674.run -noprompt -prefix=/tmp/samples}}


{{Note|text=These samples are part of the [https://developer.nvidia.com/cuda-toolkit-70 CUDA Toolkit] and can be extracted from the toolkit installer using the ''--extract=/path'' option.}}
== Compiling the CUDA Toolkit examples ==


<br /><br />
You can also compile all the examples at once but it will take a while. From the CUDA samples source directory (<code class="file">/tmp/samples</code>), run make to compile examples:
Then you can go to installation path, in our case <code class="file">/tmp/samples/</code>. If you list the directory, you will see a lot of folder starting with a number, these are CUDA examples. CUDA examples are described in the document named <code class="file">Samples.html</code>. You might also want to have a look to <code class="file">doc</code> directory or the [http://docs.nvidia.com/cuda/cuda-samples/index.html#getting-started-with-cuda-samples online documentation].<br /><br />
We will now compile examples, this will take a little time. From CUDA samples installation directory (<code class="file">/tmp/samples</code>), run make:
{{Term|location=node|cmd=<code class="command">cd /tmp/samples</code>}}
{{Term|location=node|cmd=<code class="command">cd /tmp/samples</code>}}
{{Term|location=node|cmd=<code class="command">make -j8</code>}}
{{Term|location=node|cmd=<code class="command">make -j8</code>}}
The compilation of all the examples is over when "Finished building CUDA samples" is printed. Alternatively, each example can also be compiled separately from its own directory.


The process is complete when "Finished building CUDA samples" is printed. You should be able to run CUDA examples. We will try the one named <code class="file">Device Query</code> which is located in <code class="file">/tmp/samples/1_Utilities/deviceQuery/</code>. This sample enumerates the properties of the CUDA devices present in the system.
You can first try the <code class="file">Device Query</code> example located in <code class="file">/tmp/samples/1_Utilities/deviceQuery/</code>. It enumerates the properties of the CUDA devices present in the system.
{{Term|location=node|cmd=<code class="command">/tmp/samples/1_Utilities/deviceQuery/deviceQuery</code>}}
{{Term|location=node|cmd=<code class="command">/tmp/samples/1_Utilities/deviceQuery/deviceQuery</code>}}


This is an example of the result on adonis cluster at Grenoble:  
Here is an example of the result on the orion cluster at Lyon:  
ebertoncello@adonis-2:/tmp/samples$ ./1_Utilities/deviceQuery/deviceQuery  
<code>
1_Utilities/deviceQuery/deviceQuery Starting...
orion-2:/tmp/samples/1_Utilities/deviceQuery/deviceQuery
/tmp/samples/1_Utilities/deviceQuery/deviceQuery Starting...
 
  CUDA Device Query (Runtime API) version (CUDART static linking)
  CUDA Device Query (Runtime API) version (CUDART static linking)
 
Detected 2 CUDA Capable device(s)
Detected 1 CUDA Capable device(s)
 
Device 0: "Tesla T10 Processor"
Device 0: "Tesla M2075"
   CUDA Driver Version / Runtime Version          5.0 / 5.0
   CUDA Driver Version / Runtime Version          7.0 / 7.0
   CUDA Capability Major/Minor version number:    1.3
   CUDA Capability Major/Minor version number:    2.0
   Total amount of global memory:                4096 MBytes (4294770688 bytes)
   Total amount of global memory:                5375 MBytes (5636554752 bytes)
   (30) Multiprocessors x ( 8) CUDA Cores/MP:   240 CUDA Cores
   (14) Multiprocessors, ( 32) CUDA Cores/MP:     448 CUDA Cores
   GPU Clock rate:                               1296 MHz (1.30 GHz)
   GPU Max Clock rate:                           1147 MHz (1.15 GHz)
   Memory Clock rate:                            800 Mhz
   Memory Clock rate:                            1566 Mhz
   Memory Bus Width:                              512-bit
   Memory Bus Width:                              384-bit
   Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
   L2 Cache Size:                                786432 bytes
   Max Layered Texture Size (dim) x layers       1D=(8192) x 512, 2D=(8192,8192) x 512
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
   Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
   Total amount of constant memory:              65536 bytes
   Total amount of constant memory:              65536 bytes
   Total amount of shared memory per block:      16384 bytes
   Total amount of shared memory per block:      49152 bytes
   Total number of registers available per block: 16384
   Total number of registers available per block: 32768
   Warp size:                                    32
   Warp size:                                    32
   Maximum number of threads per multiprocessor:  1024
   Maximum number of threads per multiprocessor:  1536
   Maximum number of threads per block:          512
   Maximum number of threads per block:          1024
   Maximum sizes of each dimension of a block:   512 x 512 x 64
   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
   Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
   Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
   Maximum memory pitch:                          2147483647 bytes
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                            256 bytes
   Texture alignment:                            512 bytes
   Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
   Run time limit on kernels:                    No
   Run time limit on kernels:                    No
   Integrated GPU sharing Host Memory:            No
   Integrated GPU sharing Host Memory:            No
   Support host page-locked memory mapping:      Yes
   Support host page-locked memory mapping:      Yes
   Alignment requirement for Surfaces:            Yes
   Alignment requirement for Surfaces:            Yes
   Device has ECC support:                        Disabled
   Device has ECC support:                        Enabled
   Device supports Unified Addressing (UVA):      No
   Device supports Unified Addressing (UVA):      Yes
   Device PCI Bus ID / PCI location ID:           12 / 0
   Device PCI Domain ID / Bus ID / location ID:   0 / 66 / 0
   Compute Mode:
   Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.0 / 5.0
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                4096 MBytes (4294770688 bytes)
  (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                            800 Mhz
  Memory Bus Width:                              512-bit
  Max Texture Dimension Size (x,y,z)            1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:              65536 bytes
  Total amount of shared memory per block:      16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                    32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:          512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:    65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:      Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:          10 / 0
  Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 2, Device0 = Tesla T10 Processor, Device1 = Tesla T10 Processor


=== Install CUDA from a base environement ===
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla M2075
Result = PASS
</code>
 
== BLAS examples ==
 
The toolkit provides the [https://developer.nvidia.com/cublas|CUBLAS] library which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [http://docs.nvidia.com/cuda/cublas/index.html|here] and several [http://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries|advanced examples] using CUBLAS are also available (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...) in the toolkit distribution.
 
The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays but the toolkit also provides [http://docs.nvidia.com/cuda/nvblas/|NVBLAS], a library that automatically *offload* compute-intensive BLAS3 routines (ie. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be [http://www.manpages.info/linux/ld.so.8.html|forcibly linked] using the LD_PRELOAD environment variable.
 
 
  /tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS
 
To test NVBLAS, you can use the matrix-matrix multiplication example available in grid5000/xeonphi/samples/matmatmul/.
To compile:
    cp -r /grid5000/xeonphi/samples/matmatmul/ /tmp/
    cd /tmp/matmatmul
    make
 
To run on the CPU, use:
  orion-2: ./matmatmul
  Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
  BLAS  - Time elapsed:  2.672E+01 sec.
 
To offload the computation on the GPU, use:
  orion-2: cd /tmp/matmatmul
  orion-2: echo "NVBLAS_CPU_BLAS_LIB /usr/lib/libblas/libblas.so" > nvblas.conf
  orion-2: LD_PRELOAD=libnvblas.so ./matmatmulc
  [NVBLAS] Config parsed
  Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
  BLAS  - Time elapsed:  1.716E+00 sec.
 
If you want to measure the time spent on data transfers to the GPU, you can compare the results with the simpleCUBLAS example and instrument the simpleCUBLAS example with timers.
 
=== Installing the CUDA toolkit on a deployed environement ===
 
GPU nodes at Lyon and Nancy are supported by the latest GPU drivers. For Lille and Grenoble, you have to install the NVIDIA 340.xx Legacy drivers. The following table resumes the situation on Grid'5000 as of January 2016:
 
{| class="program" style="border:1px dotted black;"
! Site
! Cluster
! GPU
! OAR property
! Driver version
! CUDA toolkit version
|-
| Lyon
| orion (4 nodes)
| Nvidia Tesla-M2075 (1 per node)
| -p "GPU='YES'"
| 346.xx (jessie), 352.xx
| CUDA 7.0 (jessie), 7.5
|-
| Nancy
| graphique (6 nodes)
| Nvidia GTX 980 GPU (2 per node)
| -p "GPU='YES'" -q production
| 346.xx (jessie), 352.xx
| CUDA 7.0 (jessie), 7.5
|-
| Grenoble
| adonis (10 nodes)
| Nvidia Tesla-C1060 (2 per node)
| -p "GPU='YES'"
| 340.xx
| CUDA 6.5
|-
| Lille
| chirloute (8 nodes)
| Nvidia Tesla-S2050 (1 per node)
| -p "GPU='SHARED'"
| 340.xx
| CUDA 6.5
|}


==== Reservation and deployment ====
==== Deployment ====


We will now install up to date NVIDIA drivers and up to date CUDA (5.5).
First, reserve a GPU node and deploy the <code class="file">jessie-x64-base</code> environment:
First reserve a node and then deploy <code class="file">wheezy-x64-base</code>.
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "GPU='<code class="replace">YES</code>'" -l /nodes=1,walltime=2}}  
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "GPU='<code class="replace">YES</code>'" -l /nodes=1,walltime=2}}  
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e wheezy-x64-base -k}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e jessie-x64-base -k}}  
<br />Once the deployment is terminated, you should be able to connect to node and download CUDA 5.5 installer to /tmp.
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}
{{Term|location=node|cmd=<code class="command">wget</code> http://developer.download.nvidia.com/compute/cuda/5_5/rel/installers/cuda_5.5.22_linux_64.run -P /tmp/}}
<br />When download is over, you can see installation options from CUDA installer
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_5.5.22_linux_64.run --help}}


root@adonis-2:/tmp# sh cuda_5.5.22_linux_64.run --help
Once the deployment is terminated, you should be able to connect to node as root:
Options:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}
    -help                      : Print help message
    -driver                    : Install NVIDIA Display Driver
    -uninstall                : Uninstall NVIDIA Display Driver
    -toolkit                  : Install CUDA 5.5 Toolkit (default: /usr/local/cuda-5.5)
    -toolkitpath=<PATH>        : Specify a custom path for CUDA location
    -samples                  : Install CUDA 5.5 Samples (default: /usr/local/cuda-5.5/samples)
    -samplespath=<PATH>        : Specify a custom path for Samples location
    -silent                    : Run in silent mode. Implies acceptance of the EULA
    -verbose                  : Run in verbose mode
    -extract=<PATH>            : Extract individual installers from the .run file to PATH
    -optimus                  : Install driver support for Optimus
    -override                  : Overrides the installation checks (compiler, lib, etc)
    -kernel-source-path=<PATH> : Points to a non-default kernel source location
    -tmpdir <PATH>             : Use <PATH> as temporary directory - useful when /tmp is noexec


We can extract all installers from this one :
==== Downloading the NVIDIA toolkit ====
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_5.5.22_linux_64.run -extract=/tmp && <code class="command">cd</code> /tmp}}
We will now install the NVIDIA drivers, compilers, librairies and examples. The complete CUDA distribution can be download from https://developer.nvidia.com/cuda-toolkit-archive|the official website] or from git.grid5000.fr/sources/. Select a toolkit version compatible with your GPU hardware:
<br />This will extract 3 installers:
cd /tmp/; wget git.grid5000.fr/sources/cuda_7.5.18_linux.run
* NVIDIA-Linux-x86_64-319.37.run: NVIDIA drivers
cd /tmp/; wget git.grid5000.fr/sources/cuda_7.0.28_linux.run
* cuda-linux64-rel-5.5.22-16488124.run: CUDA installer
cd /tmp/; wget git.grid5000.fr/sources/cuda_6.5.14_linux_64.run # chirloute, adonis
* cuda-samples-linux-5.5.22-16488124.run: CUDA samples installer
<br /> Each of them also have installation options :
{{Term|location=node|cmd=<code class="command">sh</code> NVIDIA-Linux-x86_64-319.37.run --help}}
{{Term|location=node|cmd=<code class="command">sh</code> cuda-linux64-rel-5.5.22-16488124.run --help}}
{{Term|location=node|cmd=<code class="command">sh</code> cuda-samples-linux-5.5.22-16488124.run --help}}


==== Installation ====
<br />When download is over, you can look at the installer options:
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_<version>.run --help}}


We are ready to install NVIDIA drivers and CUDA 5.5. We will also use gcc-4.6 which is needed for this installation.
There is actually three distincts installers (for the drivers, compilers and examples) embedded on this file and you can extract them using:
{{Term|location=node|cmd=CC=/usr/bin/gcc-4.6 <code class="command">sh</code> NVIDIA-Linux-x86_64-319.37.run --accept-license --silent --disable-nouveau -X --kernel-name=`uname -r`}}
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_<version>.run -extract=/tmp/installers && <code class="command">cd</code> /tmp}}
{{Term|location=node|cmd=CC=/usr/bin/gcc-4.6 <code class="command">sh</code> cuda-linux64-rel-5.5.22-16488124.run -noprompt}}
It extracts 3 files. For example, with cuda_7.5.18_linux.run, you obtain:
<br /> NVIDIA drivers and CUDA 5.5 are successfully installed, but some more configuration is needed such as add <code class="file">/usr/local/cuda-5.5/bin</code> to <code class="file">$PATH</code> and add <code class="file">/usr/local/cuda-5.5/lib64</code> and <code class="file">/usr/local/cuda-5.5/lib</code> to <code class="file">$LD_LIBRARY_PATH</code>.
* NVIDIA-Linux-x86_64-352.39.run: the drivers installer (version 352.39)
{{Term|location=node|cmd=<code class="command">export</code> PATH=$PATH:/usr/local/cuda-5.5/bin}}
* cuda-linux64-rel-7.5.18-19867135.run: the CUDA toolkit installer (ie. compilers, librairies)
{{Term|location=node|cmd=<code class="command">export</code> LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64:/usr/local/cuda-5.5/lib:$LD_LIBRARY_PATH}}
* cuda-samples-linux-7.5.18-19867135.run: the CUDA samples installer
But this is not permanent after a disconnection from your ssh session, this changes will be lost. To make this permanent, you have to edit <code class="file">/etc/profile</code> and create a file under <code class="file">/etc/ld.so.conf.d/</code> directory
Each installers provides a --help option.
{{Term|location=node|cmd=<code class="command">sed</code> -e "s/:\/bin/:\/bin:\/usr\/local\/cuda-5.5\/bin/" -i /etc/profile}}
{{Term|location=node|cmd=<code class="command">echo</code> -e "/usr/local/cuda-5.5/lib\n/usr/local/cuda-5.5/lib64" > /etc/ld.so.conf.d/cuda.conf && <code class="command">ldconfig</code>}}
<br />We almost done the installation. The last step have to add a script which will enable GPU and execute it at start-up.<br />
* Create a file named <code class="file">/usr/local/bin/enable-gpu</code> and copy the following script in it.


<source lang="bash">#!/bin/bash
==== Driver installation ====


if /sbin/modprobe nvidia; then
To install the linux driver (ie. kernel module), we need the kernel header files and gcc 4.8 (as the module should be compiled with the same version of gcc used to compile the kernel in the first place):
  if [ "$?" -eq 0 ]; then
    # Count the number of NVIDIA controllers found.
    NVDEVS=`lspci | grep -i NVIDIA`
    N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
    NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`


    N=`expr $N3D + $NVGA - 1`
<code>
    for i in `seq 0 $N`; do
apt-get -y update && apt-get -y upgrade
      mknod -m 666 /dev/nvidia$i c 195 $i
apt-get -y install make
    done
apt-get -y install linux-headers-amd64 # it also installs gcc-4.8
</code>


    mknod -m 666 /dev/nvidiactl c 195 255
To compile and install the kernel module, use:
  cd /tmp/installers
  CC=gcc-4.8 sh NVIDIA-Linux-x86_<version>.run --accept-license --silent --no-install-compat32-libs  # note: do not use --no-install-compat32-libs with CUDA 6.5
(warnings about X.Org can safely be ignored)


  else
To install the CUDA toolkit, use:
    exit 1
{{Term|location=node|cmd=<code class="command">sh</code> cuda-linux64-rel-<version>.run -noprompt}}
  fi
fi</source>
* Give execution rights to the script file:
{{Term|location=node|cmd=<code class="command">chmod</code> +x /usr/local/bin/enable-gpu}}
* Finally, add a line in the /etc/rc.local file to execute your script at boot time.
{{Term|location=node|cmd=<code class="command">sed</code> -e "s/exit 0//" -i /etc/rc.local; <code class="command">echo</code> -e "#Enable GPUs at boot time\nsh /usr/local/bin/enable-gpu\n\nexit 0" >> /etc/rc.local}}


You can add the CUDA toolkit to your current shell environment by using:
export PATH=$PATH:/usr/local/cuda-<version>/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-<VERSION>/lib64


==== Check installation ====
To make those environment variables permanent for future ssh sessions you can add them to ~/.bashrc. Alternatively, you can edit the default PATH in <code class="file">/etc/profile</code> and add a configuration file for the dynamic linker under <code class="file">/etc/ld.so.conf.d/</code> as follow:
{{Term|location=node|cmd=<code class="command">sed</code> -e "s/:\/bin/:\/bin:\/usr\/local\/cuda-<version>\/bin/" -i /etc/profile}}
{{Term|location=node|cmd=<code class="command">echo</code> -e "/usr/local/cuda-5.5/lib\n/usr/local/cuda-<version>/lib64" > /etc/ld.so.conf.d/cuda.conf}}
You also need to run <code class="command">ldconfig</code> as root to update the linker configuration.


To check if NVIDIA drivers are correctly installed, you can use nvidia-smi tool.
To check if NVIDIA drivers are correctly installed, you can use the nvidia-smi tool:
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}}
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}}
This is an example of the result on adonis cluster:
 
Here is an example of the result on adonis cluster:
  root@adonis-2:~# nvidia-smi  
  root@adonis-2:~# nvidia-smi  
  Wed Dec  4 14:42:08 2013       
  Wed Dec  4 14:42:08 2013       
Line 228: Line 248:
  | N/A  36C  N/A    N/A /  N/A |        3MB /  4095MB |    N/A      Default |
  | N/A  36C  N/A    N/A /  N/A |        3MB /  4095MB |    N/A      Default |
  +-------------------------------+----------------------+----------------------+
  +-------------------------------+----------------------+----------------------+
<br />To check CUDA installation we will compile CUDA examples.
* Install dependencies
{{Term|location=node|cmd=<code class="command">apt-get</code> update && <code class="command">apt-get</code> -y install freeglut3-dev libxmu-dev libxi-dev}}
* Uncompress samples and compile them
{{Term|location=node|cmd=<code class="command">sh </code> cuda-samples-linux-5.5.22-16488124.run -cudaprefix=/usr/local/cuda-5.5}}
You will be prompt to accept the EULA and for a installation path. In this example we will install examples in <code class="file">/usr/local/cuda-5.5/samples</code>
* To compile CUDA samples, go to installation directory and then type make.
* Once compilation is over, you should be able to run some samples:
{{Term|location=node|cmd=<code class="command">/usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery</code>}}
root@adonis-7:/tmp# /usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery Starting...
   
   
CUDA Device Query (Runtime API) version (CUDART static linking)
Then, you can compile and run the toolkit examples. You need the g++ compiler to do so:
apt-get install g++
Detected 2 CUDA Capable device(s)
sh cuda-samples-linux-7.5.18-19867135.run -noprompt -prefix=/tmp/samples -cudaprefix=/usr/local/cuda-7.5/
cd /tmp/samples
Device 0: "Tesla T10 Processor"
make -j8
  CUDA Driver Version / Runtime Version          5.5 / 5.5
See [#Compiling the CUDA Toolkit examples] for more information.
  CUDA Capability Major/Minor version number:    1.3
 
  Total amount of global memory:                4096 MBytes (4294770688 bytes)
You can save your newly created environment with [TGZ-G5K|tgz-g5k]:
  (30) Multiprocessors, ( 8) CUDA Cores/MP:    240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                            800 Mhz
  Memory Bus Width:                              512-bit
  Maximum Texture Dimension Size (x,y,z)        1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:              65536 bytes
  Total amount of shared memory per block:      16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                    32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:          512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:      Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:          12 / 0
  Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                4096 MBytes (4294770688 bytes)
  (30) Multiprocessors, (  8) CUDA Cores/MP:    240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                            800 Mhz
  Memory Bus Width:                              512-bit
  Maximum Texture Dimension Size (x,y,z)        1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:              65536 bytes
  Total amount of shared memory per block:      16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                    32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:          512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:      Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:          10 / 0
  Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 2, Device0 = Tesla T10 Processor, Device1 = Tesla T10 Processor
Result = PASS
Please note that CUDA samples are described on the document named Samples.html located on samples installation directory.
<br />Do not forget to backup your environment:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_FILE_NODE` tgz-g5k > <code class="replace">myimagewithcuda</code>.tgz}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_FILE_NODE` tgz-g5k > <code class="replace">myimagewithcuda</code>.tgz}}


= Intel Xeon Phi (MIC) on Grid'5000 =
= Intel Xeon Phi (MIC) on Grid'5000 =


== Reserve a Phi on Nancy site ==
== Reserve a Xeon Phi at Nancy ==
 
As NVIDIA GPU, [https://en.wikipedia.org/wiki/Xeon_Phi|Xeon Phi] coprocessor cards provides additional compute power and can be used to offload computations. As those extension cards run a modified Linux kernel, it is also possible to log in directly onto the Xeon Phi via ssh. Also,
it is possible to compile application for the Xeon Phi processor (which is based on x86 technology) and runs it natively on the embedded Linux system of the Xeon Phi card. Xeon Phi [http://ark.intel.com/products/75799/Intel-Xeon-Phi-Coprocessor-7120P-16GB-1_238-GHz-61-core|7120P] are available at Nancy.


Xeon Phi are extension boards embedded in regular compute nodes. To reserve a Grid'5000 node that includes a Xeon Phi, use this command:
To reserve a Grid'5000 node that includes a Xeon Phi, you can use this command:
   
   
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "MIC='YES'" -t allow_classic_ssh -t mic}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "MIC='YES'" -t allow_classic_ssh -t mic}}


== Configuring the Intel compiler to use a licenses server ==
== Configuring the Intel compiler to use a license server ==
 
In order to compile programs for the Intel Xeon Phi, you have to use Intel compilers. Intel compilers are available in /grid5000/compilers/icc13.2/ at Nancy but they require a commercial (or academic) license and such licenses are not provided by Grid'5000. You might have access to a license server at your local laboratory. For instance, you have access to the Inria license server if you are on the Inria network. You can then use your machine as a bridge between the license server of your local network and your Grid'5000 nodes by creating an SSH tunnel. The procedure is explain below. Alternativly, you can compile your programs elsewhere and copy your executables on Grid'5000.


In order to run programs on Intel Xeon Phi, you should compile them using Intel compilers. An instance is available on /grid5000/compilers/icc13.2/ on Nancy's site.  
Note that [https://gcc.gnu.org/wiki/Offloading|GCC] and [Clang|http://openmp.llvm.org/] also provides a limited support for newest Xeon Phi. See [https://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm|this page] for more information about Third Party Tools.


Intel compilers require commercial license that are not provided by Grid'5000. To use them, you need an access to a token (licenses) server that provides this license. Then, you will use a machine (your laptop for instance) as a bridge between the license server and your Grid'5000 nodes.
=== Using a license server ===


=== Using licenses server ===
In the following, we will setup a SSH tunnel between a license server and a Grid'5000 node (graphite-X). The Intel compilers will be configured to use localhost:28618 as the license server and the SSH tunnel will forward connections from localhost:28618 to the license server (you can use any local port number for this). On the following, we use the Inria license server named '''jetons.inria.fr''', ports '''29030''' and '''34430'''.
Create your licenses file (you may need a different the port number):


On the Nancy frontend, create a license configuration file for the Intel compilers:
   {{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/intel}}  
   {{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/intel}}  


Line 338: Line 288:
  EOF
  EOF


Then, create an SSH tunnel :
Then, start an SSH tunnel:
 
{{Term|location=laptop|cmd=<code class="command">ssh</code> -R 28618:jetons.inria.fr:29030 -R 34430:jetons.inria.fr:34430 graphite-<code class="replace">X</code>.nancy.g5k}}
{| width="100%"
The previous command open a shell session that can be used directly. You should keep it open as long as you need the Intel compilers.
|-
| width="50%" |
 
from the command line:
{{Term|location=laptop|cmd=<code class="command">ssh</code> -R 28618:<code class="replace">''LICENSE_SERVER''</code>:28618 -R 28619:<code class="replace">''LICENSE_SERVER''</code>:28619 graphite-<code class="replace">X</code>.nancy.g5k}}
||
| width="50%" |
or from the SSH configuration (see [https://www.grid5000.fr/mediawiki/index.php/SSH#Setting_up_a_user_config_file]):


You can also add the tunnel setup to your [https://www.grid5000.fr/mediawiki/index.php/SSH#Setting_up_a_user_config_file|SSH configuration file] (.ssh/config):
  Host g5k
  Host g5k
   Hostname access.grid5000.fr
   Hostname access.grid5000.fr
  ...
Host *.intel
  User <code class="replace">g5kuser</code>
  ForwardAgent no
  RemoteForward *:28618 <code class="replace">''LICENSE_SERVER''</code>:28618
  RemoteForward *:28619 <code class="replace">''LICENSE_SERVER''</code>:28619
  ProxyCommand ssh g5k "nc -q 0 `basename %h .intel` %p"
Then connect to your node:
{{Term|location=laptop|cmd=<code class="command">ssh</code> graphite-<code class="replace">X</code>.nancy.intel}}
|}
=== Using the Inria licenses server (if you are on an Inria network) ===
The Inria license server is '''jetons.inria.fr''', ports '''29030''' and '''34430'''.
Create your licenses file:
  {{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/intel}}
cat <<EOF >> ~/intel/licenses
SERVER localhost ANY <code class="replace">28618</code>
USE_SERVER
EOF
Then, create an SSH tunnel :


{| width="100%"
  [...]
|-
| width="50%" |
 
from the command line:
{{Term|location=laptop|cmd=<code class="command">ssh</code> -R 28618:jetons.inria.fr:29030 -R 34430:jetons.inria.fr:34430 graphite-<code class="replace">X</code>.nancy.g5k}}
||
| width="50%" |
or from the SSH configuration (see [https://www.grid5000.fr/mediawiki/index.php/SSH#Setting_up_a_user_config_file]):
 
Host g5k
  Hostname access.grid5000.fr
  ...
   
   
  Host *.intel
  Host *.intel
   User <code class="replace">g5kuser</code>
   User <code class="replace">g5klogin</code>
   ForwardAgent no
   ForwardAgent no
   RemoteForward *:28618 jetons.inria.fr:29030
   RemoteForward *:28618 jetons.inria.fr:29030
   RemoteForward *:34430 jetons.inria.fr:34430
   RemoteForward *:34430 jetons.inria.fr:34430
   ProxyCommand ssh g5k "nc -q 0 `basename %h .intel` %p"
   ProxyCommand ssh g5k -W "$(basename %h .intel):%p"
 
Then, to create the tunnel and connect to your node, you can simply use:
{{Term|location=laptop|cmd=<code class="command">ssh</code> graphite-<code class="replace">X</code>.nancy.intel}}


To test the tunnel, you can do:
{{Term|location=graphite|cmd=<code class="command">source</code> /opt/intel/composerxe/bin/compilervars.sh intel64}}
{{Term|location=graphite|cmd=<code class="command">icc</code> -v}}


Then connect to your node:
Using Intel compilers on Grid'5000 can be rather slow due to the license server connection.
{{Term|location=laptop|cmd=<code class="command">ssh</code> graphite-<code class="replace">X</code>.nancy.intel}}
|}


== Execution on Xeon Phi ==
== Execution on Xeon Phi ==
An introduction to the Phi programming environment is available [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment on the Intel website].


Other resources:
An introduction to the Xeon Phi programming environment is available [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment on the Intel website]. Other useful resources include:
* [http://spscicomp.org/wordpress/pages/the-intel-xeon-phi/ The IBM HPC Systems Scientific Computing User Group]
* [http://spscicomp.org/wordpress/pages/the-intel-xeon-phi/ The IBM HPC Systems Scientific Computing User Group Tutorial]
* [http://www.hpc.cineca.it/content/quick-guide-intel-mic-usage CINECA/SCAI].
* [http://www.hpc.cineca.it/content/quick-guide-intel-mic-usage CINECA/SCAI documentation].


You can check the status of the MIC card using micinfo:
{{Term|location=graphite|cmd=<code class="command">micinfo</code>}}
=== Offload mode ===
=== Offload mode ===


In offload mode, your program is started on the node, but part of its execution is offloaded to the Phi.  
In offload mode, your program is executed on the Host, but part of its execution is offloaded to the co-processor card.
 
{{Note|text=This section uses a code snippet from the [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment Intel tutorial]}}


Compile some source code:
Compile some source code:
cd /tmp


{{Term|location=graphite|cmd=<code class="command">source</code> /opt/intel/composerxe/bin/compilervars.sh intel64}}
{{Term|location=graphite|cmd=<code class="command">source</code> /opt/intel/composerxe/bin/compilervars.sh intel64}}
{{Term|location=graphite|cmd=<code class="command">icpc</code> -openmp /grid5000/xeonphi/samples/reduction.cpp -o reduction-offload}}
{{Term|location=graphite|cmd=<code class="command">icpc</code> -openmp /grid5000/xeonphi/samples/reduction.cpp -o reduction-offload}}
Check the status of the MIC card:
{{Term|location=graphite|cmd=<code class="command">micinfo</code>}}


And execute it:
And execute it:


  {{Term|location=graphite|cmd=<code class="command">./reduction-offload</code>}}
  {{Term|location=graphite|cmd=<code class="command">./reduction-offload</code>}}
{{Note|text=This section uses a code snippet from the [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment Intel tutorial]}}


=== Native mode ===
=== Native mode ===


In native mode, your program is directly executed inside the Phi. You need to connect to the Phi using SSH first.
In native mode, your program is completely executed on the Xeon Phi. Your code must be compiled natively for the Xeon Phi architecture using the -mmic option:
 
Compile some source code:


{{Term|location=graphite|cmd=<code class="command">source</code> /grid5000/compilers/icc13.2/bin/compilervars.sh intel64}}
{{Term|location=graphite|cmd=<code class="command">icpc</code> /grid5000/xeonphi/samples/hello.cpp -openmp -mmic -o hello-native}}
{{Term|location=graphite|cmd=<code class="command">icpc</code> /grid5000/xeonphi/samples/hello.cpp -openmp -mmic -o hello-native}}


You need to connect to the card using SSH first.
Login on Phi:
Login on Phi:


Line 448: Line 354:


And execute:
And execute:
{{Term|location=graphite-mic0|cmd=<code class="command"> source</code> /grid5000/xeonphi/micenv}}
{{Term|location=graphite-mic0|cmd=<code class="command"> source</code> /grid5000/xeonphi/micenv}}
{{Term|location=grahite-mic0|cmd=<code class="command"> ./hello-native</code>}}
{{Term|location=grahite-mic0|cmd=<code class="command"> ./hello-native</code>}}
 
== Use Xeon Phi from a "min" deployed environment ==
 
=== Reserve and deploy a node with MIC ===
 
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "MIC='YES'" -t deploy}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -k -e wheezy-x64-min}}
 
You can now log on the node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}
 
=== Install Xeon Phi drivers ===
 
MPSS drivers are available for wheezy on grid5000 debian repository, you just need to uncomment this line to /etc/apt/sources.list:
deb http://apt.grid5000.fr/debian sid main
 
Get grid5000 keyring:
{{Term|location=node|cmd=<code class="command">apt-get</code> update && apt-get install grid5000-keyring -y --force-yes}}
Install it:
{{Term|location=node|cmd=<code class="command">apt-get</code> install mpss-modules-3.2.0-4-amd64 mpss-micmgmt mpss-miccheck mpss-coi mpss-mpm mpss-metadata mpss-miccheck-bin glibc2.12.2pkg-libsettings0 glibc2.12.2pkg-libmicmgmt0 libscif0 mpss-daemon mpss-boot-files mpss-sdk-k1om intel-composerxe-compat-k1om g++ -y --force-yes}}
 
=== Copy configuration files ===
 
Configuration are available on nfs server, so we should mount a partition to retrieve configurations files.
{{Term|location=node|cmd=<code class="command">apt-get</code> install nfs-common -y}}
{{Term|location=node|cmd=<code class="command">mount</code> nfs:/export/grid5000 /grid5000/}}
Mount automaticaly <code class="file">/grid5000</code> at boot: append the following line to <code class="file">/etc/fstab</code>
  nfs:/export/grid5000/ /grid5000/ nfs defaults 0 0
 
After that we can copy configuration files:
{{Term|location=node|cmd=<code class="command">cp</code> /grid5000/xeonphi/conf/default.conf /etc/mpss/}}
{{Term|location=node|cmd=<code class="command">cp</code> /grid5000/xeonphi/conf/mic0.conf /etc/mpss/}}
{{Term|location=node|cmd=<code class="command">cp</code> /grid5000/xeonphi/conf/mpss /etc/init.d/}}
{{Term|location=node|cmd=<code class="command">cp</code> /grid5000/xeonphi/conf/interfaces /etc/network}}
 
Start mpss on boot:
{{Term|location=node|cmd=<code class="command">update-rc.d</code> mpss defaults}}
 
Add mic module on boot:
{{Term|location=node|cmd=<code class="command">echo</code> mic >> /etc/modules}}
 
If you want to mount /home and /grid5000 on mic you should:
 
Add new file on /var/mpss/mic0/etc/fstab
nfs:/export/home /home nfs rsize=8192,wsize=8192,nolock,intr 0 0
nfs:/export/grid5000 /grid5000 nfs rsize=8192,wsize=8192,nolock,intr 0 0
 
And add this line on /var/mpss/mic0.filelist
file /etc/fstab etc/fstab 644 0 0
dir /grid5000 755 0 0
 
Now we can reboot the machine:
{{Term|location=node|cmd=<code class="command">reboot</code>}}
 
After reboot you can use MIC as you want. See [[Accelerators on Grid5000#Configuring the Intel compiler to use a licenses server|above]] for more details


{{Note|text= On a freshly installed node, don't forget to recreate a <code class=file>~/intel/licenses</code> file, except if you manually mounted your frontend home on your node home}}
{{Term|location=graphite-mic0|cmd=source /grid5000/software/intel/mkl/bin/mklvars.sh mic}}
{{Term|location=graphite-mic0|cmd=icpc matmatmul_mkl.c -openmp -o matmatmul_mkl -mkl}}
{{Term|location=graphite-mic0|cmd=icpc matmatmul_mkl.c -openmp -o matmatmul_mkl -mkl -mmic}}

Revision as of 11:15, 20 January 2016


Introduction

This tutorial presents how to use GPU Accelerators and Intel Xeon Phi Coprocessors on Grid'5000. You will learn to reserve these resources, setup the environnement and execute codes on the accelerators. The dense matrix-matrix multiplication example of this tutorial can be used as a toy benchmark to compare the performance of accelerators and/or BLAS implementations. Please note that this page is not about GPU or Xeon Phi programming and only focus on the specificities of the Grid'5000 plateform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is useful for either testing the latest version of a driver or ensuring the reproducibility of your experiments (by freezing its context).

This tutorial is divided into two distinct parts that can be done in any order:

* #GPU accelerators on Grid'5000
* #Intel Xeon Phi (MIC) on Grid'5000

For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the Getting Started tutorial first to get familiar with the plateform (connections to the plateform, resource reservations) and its basic concepts (job scheduling, environment deployment). Special:G5KHardware is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see [Status]).

GPU accelerators on Grid'5000

In this section, we first reserve a GPU node. We then compile and execute examples provided by the CUDA Toolkit on the default (production) environment. We also run our [1] example to illustrate GPU performance for dense matrix multiply. Finally, we deploy a jessie-x64-base environment and install the NVIDIA drivers and compilers before validating the installation on the previous example set.

http://docs.nvidia.com/cuda/cuda-samples/index.html#matrix-multiplication--cublas-

Selection of GPU nodes

You can reserve a GPU node by simply requesting resources with the OAR "GPU" property:

Terminal.png frontend:
oarsub -I -p "GPU='YES'"

At Lille (on chirloute), you have to use the GPU='SHARED" property instead:

Terminal.png lille:
oarsub -I -p "GPU='SHARED'"

The reason is that those GPUs shared enclosures by groups of four and can only be rebooted in groups. You may encounter some difficulties on those shared GPUs but if nvidia-smi -q does not detect the GPU on your node, you can find troubleshooting information on this page.

At Nancy, you have to use the production queue to get ressources from graphite (and you also have to comply with the [UserCharter|usage policiy] of production resources)

Terminal.png frontend:
oarsub -I -p "GPU='YES'" -q production

NVIDIA drivers 346.22 (see `nvidia-smi`) and CUDA 7.0 (`nvcc --version`) compilation tools are installed by default on nodes. This version of the drivers only support the most recent GPU accelerators such as the GPU installed on orion (Lyon) and graphite (in the [Nancy:Production|production queue]] of Nancy). You can use the GPU accelerators of adonis (Grenoble) and chirloute (Lille) with the NVIDIA 340.xx Legacy drivers on deployed environments. You can deploy the ready-to-use wheezy-x64-prod environment or see [#Installing the CUDA toolkit on a deployed environement] for using those GPUs with Debian Jessie.

Downloading the CUDA Toolkit examples

We download CUDA 7.0 samples and extract them on /tmp/samples:

Terminal.png node:
sh cuda-samples-linux-7.0.28-19326674.run -noprompt -prefix=/tmp/samples
Note.png Note

These samples are part of the CUDA 7.0 Toolkit and can also be extracted from the toolkit installer using the --extract=/path option.

Note.png Note

On adonis and chirloute, install the CUDA 5.0 samples (cuda-samples_5.0.35_linux.run).

The CUDA examples are described in /tmp/samples/Samples.html. You might also want to have a look at the doc directory or the online documentation.

Compiling the CUDA Toolkit examples

You can also compile all the examples at once but it will take a while. From the CUDA samples source directory (/tmp/samples), run make to compile examples:

Terminal.png node:
cd /tmp/samples
Terminal.png node:
make -j8

The compilation of all the examples is over when "Finished building CUDA samples" is printed. Alternatively, each example can also be compiled separately from its own directory.

You can first try the Device Query example located in /tmp/samples/1_Utilities/deviceQuery/. It enumerates the properties of the CUDA devices present in the system.

Terminal.png node:
/tmp/samples/1_Utilities/deviceQuery/deviceQuery

Here is an example of the result on the orion cluster at Lyon: orion-2:/tmp/samples/1_Utilities/deviceQuery/deviceQuery /tmp/samples/1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla M2075"

 CUDA Driver Version / Runtime Version          7.0 / 7.0
 CUDA Capability Major/Minor version number:    2.0
 Total amount of global memory:                 5375 MBytes (5636554752 bytes)
 (14) Multiprocessors, ( 32) CUDA Cores/MP:     448 CUDA Cores
 GPU Max Clock rate:                            1147 MHz (1.15 GHz)
 Memory Clock rate:                             1566 Mhz
 Memory Bus Width:                              384-bit
 L2 Cache Size:                                 786432 bytes
 Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
 Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
 Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       49152 bytes
 Total number of registers available per block: 32768
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  1536
 Maximum number of threads per block:           1024
 Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
 Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             512 bytes
 Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Enabled
 Device supports Unified Addressing (UVA):      Yes
 Device PCI Domain ID / Bus ID / location ID:   0 / 66 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla M2075 Result = PASS

BLAS examples

The toolkit provides the [2] library which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [3] and several examples using CUBLAS are also available (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...) in the toolkit distribution.

The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays but the toolkit also provides [4], a library that automatically *offload* compute-intensive BLAS3 routines (ie. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be linked using the LD_PRELOAD environment variable.


 /tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS

To test NVBLAS, you can use the matrix-matrix multiplication example available in grid5000/xeonphi/samples/matmatmul/. To compile:

   cp -r /grid5000/xeonphi/samples/matmatmul/ /tmp/
   cd /tmp/matmatmul
   make

To run on the CPU, use:

 orion-2: ./matmatmul 
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  2.672E+01 sec.

To offload the computation on the GPU, use:

 orion-2: cd /tmp/matmatmul
 orion-2: echo "NVBLAS_CPU_BLAS_LIB /usr/lib/libblas/libblas.so" > nvblas.conf
 orion-2: LD_PRELOAD=libnvblas.so ./matmatmulc
 [NVBLAS] Config parsed
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  1.716E+00 sec.

If you want to measure the time spent on data transfers to the GPU, you can compare the results with the simpleCUBLAS example and instrument the simpleCUBLAS example with timers.

Installing the CUDA toolkit on a deployed environement

GPU nodes at Lyon and Nancy are supported by the latest GPU drivers. For Lille and Grenoble, you have to install the NVIDIA 340.xx Legacy drivers. The following table resumes the situation on Grid'5000 as of January 2016:

Site Cluster GPU OAR property Driver version CUDA toolkit version
Lyon orion (4 nodes) Nvidia Tesla-M2075 (1 per node) -p "GPU='YES'" 346.xx (jessie), 352.xx CUDA 7.0 (jessie), 7.5
Nancy graphique (6 nodes) Nvidia GTX 980 GPU (2 per node) -p "GPU='YES'" -q production 346.xx (jessie), 352.xx CUDA 7.0 (jessie), 7.5
Grenoble adonis (10 nodes) Nvidia Tesla-C1060 (2 per node) -p "GPU='YES'" 340.xx CUDA 6.5
Lille chirloute (8 nodes) Nvidia Tesla-S2050 (1 per node) -p "GPU='SHARED'" 340.xx CUDA 6.5

Deployment

First, reserve a GPU node and deploy the jessie-x64-base environment:

Terminal.png frontend:
oarsub -I -t deploy -p "GPU='YES'" -l /nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODE_FILE -e jessie-x64-base -k

Once the deployment is terminated, you should be able to connect to node as root:

Terminal.png frontend:
ssh root@`head -1 $OAR_NODE_FILE`

Downloading the NVIDIA toolkit

We will now install the NVIDIA drivers, compilers, librairies and examples. The complete CUDA distribution can be download from https://developer.nvidia.com/cuda-toolkit-archive%7Cthe official website] or from git.grid5000.fr/sources/. Select a toolkit version compatible with your GPU hardware: cd /tmp/; wget git.grid5000.fr/sources/cuda_7.5.18_linux.run cd /tmp/; wget git.grid5000.fr/sources/cuda_7.0.28_linux.run cd /tmp/; wget git.grid5000.fr/sources/cuda_6.5.14_linux_64.run # chirloute, adonis


When download is over, you can look at the installer options:

Terminal.png node:
sh /tmp/cuda_<version>.run --help

There is actually three distincts installers (for the drivers, compilers and examples) embedded on this file and you can extract them using:

Terminal.png node:
sh /tmp/cuda_<version>.run -extract=/tmp/installers && cd /tmp

It extracts 3 files. For example, with cuda_7.5.18_linux.run, you obtain:

  • NVIDIA-Linux-x86_64-352.39.run: the drivers installer (version 352.39)
  • cuda-linux64-rel-7.5.18-19867135.run: the CUDA toolkit installer (ie. compilers, librairies)
  • cuda-samples-linux-7.5.18-19867135.run: the CUDA samples installer

Each installers provides a --help option.

Driver installation

To install the linux driver (ie. kernel module), we need the kernel header files and gcc 4.8 (as the module should be compiled with the same version of gcc used to compile the kernel in the first place):

apt-get -y update && apt-get -y upgrade apt-get -y install make apt-get -y install linux-headers-amd64 # it also installs gcc-4.8

To compile and install the kernel module, use:

 cd /tmp/installers
 CC=gcc-4.8 sh NVIDIA-Linux-x86_<version>.run --accept-license --silent --no-install-compat32-libs  # note: do not use --no-install-compat32-libs with CUDA 6.5

(warnings about X.Org can safely be ignored)

To install the CUDA toolkit, use:

Terminal.png node:
sh cuda-linux64-rel-<version>.run -noprompt

You can add the CUDA toolkit to your current shell environment by using: export PATH=$PATH:/usr/local/cuda-<version>/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-<VERSION>/lib64

To make those environment variables permanent for future ssh sessions you can add them to ~/.bashrc. Alternatively, you can edit the default PATH in /etc/profile and add a configuration file for the dynamic linker under /etc/ld.so.conf.d/ as follow:

Terminal.png node:
sed -e "s/:\/bin/:\/bin:\/usr\/local\/cuda-<version>\/bin/" -i /etc/profile
Terminal.png node:
echo -e "/usr/local/cuda-5.5/lib\n/usr/local/cuda-<version>/lib64" > /etc/ld.so.conf.d/cuda.conf

You also need to run ldconfig as root to update the linker configuration.

To check if NVIDIA drivers are correctly installed, you can use the nvidia-smi tool:

Terminal.png node:
nvidia-smi

Here is an example of the result on adonis cluster:

root@adonis-2:~# nvidia-smi 
Wed Dec  4 14:42:08 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.319.37   Driver Version: 319.37         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T10 Proce...  Off  | 0000:0A:00.0     N/A |                  N/A |
| N/A   36C  N/A     N/A /  N/A |        3MB /  4095MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T10 Proce...  Off  | 0000:0C:00.0     N/A |                  N/A |
| N/A   36C  N/A     N/A /  N/A |        3MB /  4095MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

Then, you can compile and run the toolkit examples. You need the g++ compiler to do so: apt-get install g++ sh cuda-samples-linux-7.5.18-19867135.run -noprompt -prefix=/tmp/samples -cudaprefix=/usr/local/cuda-7.5/ cd /tmp/samples make -j8 See [#Compiling the CUDA Toolkit examples] for more information.

You can save your newly created environment with [TGZ-G5K|tgz-g5k]:

Terminal.png frontend:
ssh root@`head -1 $OAR_FILE_NODE` tgz-g5k > myimagewithcuda.tgz

Intel Xeon Phi (MIC) on Grid'5000

Reserve a Xeon Phi at Nancy

As NVIDIA GPU, Phi coprocessor cards provides additional compute power and can be used to offload computations. As those extension cards run a modified Linux kernel, it is also possible to log in directly onto the Xeon Phi via ssh. Also, it is possible to compile application for the Xeon Phi processor (which is based on x86 technology) and runs it natively on the embedded Linux system of the Xeon Phi card. Xeon Phi [5] are available at Nancy.

To reserve a Grid'5000 node that includes a Xeon Phi, you can use this command:

Terminal.png frontend:
oarsub -I -p "MIC='YES'" -t allow_classic_ssh -t mic

Configuring the Intel compiler to use a license server

In order to compile programs for the Intel Xeon Phi, you have to use Intel compilers. Intel compilers are available in /grid5000/compilers/icc13.2/ at Nancy but they require a commercial (or academic) license and such licenses are not provided by Grid'5000. You might have access to a license server at your local laboratory. For instance, you have access to the Inria license server if you are on the Inria network. You can then use your machine as a bridge between the license server of your local network and your Grid'5000 nodes by creating an SSH tunnel. The procedure is explain below. Alternativly, you can compile your programs elsewhere and copy your executables on Grid'5000.

Note that [6] and [Clang|http://openmp.llvm.org/] also provides a limited support for newest Xeon Phi. See page for more information about Third Party Tools.

Using a license server

In the following, we will setup a SSH tunnel between a license server and a Grid'5000 node (graphite-X). The Intel compilers will be configured to use localhost:28618 as the license server and the SSH tunnel will forward connections from localhost:28618 to the license server (you can use any local port number for this). On the following, we use the Inria license server named jetons.inria.fr, ports 29030 and 34430.

On the Nancy frontend, create a license configuration file for the Intel compilers:

Terminal.png frontend:
mkdir ~/intel
cat <<EOF >> ~/intel/licenses 
SERVER localhost ANY 28618
USE_SERVER
EOF

Then, start an SSH tunnel:

Terminal.png laptop:
ssh -R 28618:jetons.inria.fr:29030 -R 34430:jetons.inria.fr:34430 graphite-X.nancy.g5k

The previous command open a shell session that can be used directly. You should keep it open as long as you need the Intel compilers.

You can also add the tunnel setup to your configuration file (.ssh/config):

Host g5k
 Hostname access.grid5000.fr
 [...]

Host *.intel
 User g5klogin
 ForwardAgent no
 RemoteForward *:28618 jetons.inria.fr:29030
 RemoteForward *:34430 jetons.inria.fr:34430
 ProxyCommand ssh g5k -W "$(basename %h .intel):%p"

Then, to create the tunnel and connect to your node, you can simply use:

Terminal.png laptop:
ssh graphite-X.nancy.intel

To test the tunnel, you can do:

Terminal.png graphite:
source /opt/intel/composerxe/bin/compilervars.sh intel64
Terminal.png graphite:
icc -v

Using Intel compilers on Grid'5000 can be rather slow due to the license server connection.

Execution on Xeon Phi

An introduction to the Xeon Phi programming environment is available on the Intel website. Other useful resources include:

You can check the status of the MIC card using micinfo:

Terminal.png graphite:
micinfo

Offload mode

In offload mode, your program is executed on the Host, but part of its execution is offloaded to the co-processor card.

Note.png Note

This section uses a code snippet from the Intel tutorial

Compile some source code:

cd /tmp

Terminal.png graphite:
source /opt/intel/composerxe/bin/compilervars.sh intel64
Terminal.png graphite:
icpc -openmp /grid5000/xeonphi/samples/reduction.cpp -o reduction-offload

And execute it:

Terminal.png graphite:
./reduction-offload

Native mode

In native mode, your program is completely executed on the Xeon Phi. Your code must be compiled natively for the Xeon Phi architecture using the -mmic option:

Terminal.png graphite:
icpc /grid5000/xeonphi/samples/hello.cpp -openmp -mmic -o hello-native

You need to connect to the card using SSH first. Login on Phi:

Terminal.png graphite:
ssh mic0
Note.png Note

Your home directory is available from inside the mic (the /grid5000 directory as well)

And execute:

Terminal.png graphite-mic0:
source /grid5000/xeonphi/micenv
Terminal.png grahite-mic0:
./hello-native
Terminal.png graphite-mic0:
source /grid5000/software/intel/mkl/bin/mklvars.sh mic
Terminal.png graphite-mic0:
icpc matmatmul_mkl.c -openmp -o matmatmul_mkl -mkl
Terminal.png graphite-mic0:
icpc matmatmul_mkl.c -openmp -o matmatmul_mkl -mkl -mmic