GPUs on Grid5000

From Grid5000
Jump to navigation Jump to search


Purpose

This page presents usage of computing accelerators, such as GPU and Xeon Phi, on Grid'5000. You will learn to reserve these resources and execute code on them. If you prefer to deploy your own environment (for instance to install latest drivers), some guidelines are also provided.

Please note that this tutorial is not about GPU or Xeon Phi programming. Many documents are available on the Web on this subject.


Pre-requisite

  • A basic knowledge of Grid'5000 is required, we suggest you read the Getting Started tutorial first.
  • Information about hardware accelerators availability can be found on Special:G5KHardware.

GPU on Grid'5000

In this section, we will first compile and use CUDA examples and in the second part, we will install NVIDIA drivers and compile CUDA 5 from a simple wheezy-x64-base environment.


Using CUDA

Download and compile examples

GPU are available in:

  • Grenoble (adonis)
  • Lyon (orion)
  • Lille (chirloute)
  • Nancy (in the production queue)

NVIDIA drivers 346.22 (see `nvidia-smi`) and CUDA 7.0 (`nvcc --version`) compilation tools are installed by default on nodes.
You can reserve a node with GPU using OAR GPU property. For Grenoble and Lyon:

Terminal.png frontend:
oarsub -I -p "GPU='YES'"

or for Lille:

Terminal.png lille:
oarsub -I -p "GPU='SHARED'"
Warning.png Warning

Please note that Lille GPU are shared between nodes and the oarsub command will differ from Grenoble or Lyon. When using GPUs at Lille, you may encounter some trouble, you can read more about Lille GPU on Lille:GPU.

We will then download CUDA 7.0 samples and install them.

Terminal.png node:
sh cuda-samples-linux-7.0.28-19326674.run -noprompt -prefix=/tmp/samples
Note.png Note

These samples are part of the CUDA Toolkit and can be extracted from the toolkit installer using the --extract=/path option.



Then you can go to installation path, in our case /tmp/samples/. If you list the directory, you will see a lot of folder starting with a number, these are CUDA examples. CUDA examples are described in the document named Samples.html. You might also want to have a look to doc directory or the online documentation.

We will now compile examples, this will take a little time. From CUDA samples installation directory (/tmp/samples), run make:

Terminal.png node:
cd /tmp/samples
Terminal.png node:
make -j8

The process is complete when "Finished building CUDA samples" is printed. You should be able to run CUDA examples. We will try the one named Device Query which is located in /tmp/samples/1_Utilities/deviceQuery/. This sample enumerates the properties of the CUDA devices present in the system.

Terminal.png node:
/tmp/samples/1_Utilities/deviceQuery/deviceQuery

This is an example of the result on adonis cluster at Grenoble:

ebertoncello@adonis-2:/tmp/samples$ ./1_Utilities/deviceQuery/deviceQuery 
1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla T10 Processor"
 CUDA Driver Version / Runtime Version          5.0 / 5.0
 CUDA Capability Major/Minor version number:    1.3
 Total amount of global memory:                 4096 MBytes (4294770688 bytes)
 (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
 GPU Clock rate:                                1296 MHz (1.30 GHz)
 Memory Clock rate:                             800 Mhz
 Memory Bus Width:                              512-bit
 Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
 Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       16384 bytes
 Total number of registers available per block: 16384
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  1024
 Maximum number of threads per block:           512
 Maximum sizes of each dimension of a block:    512 x 512 x 64
 Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             256 bytes
 Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Disabled
 Device supports Unified Addressing (UVA):      No
 Device PCI Bus ID / PCI location ID:           12 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla T10 Processor"
 CUDA Driver Version / Runtime Version          5.0 / 5.0
 CUDA Capability Major/Minor version number:    1.3
 Total amount of global memory:                 4096 MBytes (4294770688 bytes)
 (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
 GPU Clock rate:                                1296 MHz (1.30 GHz)
 Memory Clock rate:                             800 Mhz
 Memory Bus Width:                              512-bit
 Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
 Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       16384 bytes
 Total number of registers available per block: 16384
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  1024
 Maximum number of threads per block:           512
 Maximum sizes of each dimension of a block:    512 x 512 x 64
 Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             256 bytes
 Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Disabled
 Device supports Unified Addressing (UVA):      No
 Device PCI Bus ID / PCI location ID:           10 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 2, Device0 = Tesla T10 Processor, Device1 = Tesla T10 Processor

Install CUDA from a base environement

Reservation and deployment

We will now install up to date NVIDIA drivers and up to date CUDA (5.5). First reserve a node and then deploy wheezy-x64-base.

Terminal.png frontend:
oarsub -I -t deploy -p "GPU='YES'" -l /nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODE_FILE -e wheezy-x64-base -k


Once the deployment is terminated, you should be able to connect to node and download CUDA 5.5 installer to /tmp.

Terminal.png frontend:
ssh root@`head -1 $OAR_NODE_FILE`


When download is over, you can see installation options from CUDA installer

Terminal.png node:
sh /tmp/cuda_5.5.22_linux_64.run --help
root@adonis-2:/tmp# sh cuda_5.5.22_linux_64.run --help
Options:
   -help                      : Print help message
   -driver                    : Install NVIDIA Display Driver
   -uninstall                 : Uninstall NVIDIA Display Driver
   -toolkit                   : Install CUDA 5.5 Toolkit (default: /usr/local/cuda-5.5)
   -toolkitpath=<PATH>        : Specify a custom path for CUDA location
   -samples                   : Install CUDA 5.5 Samples (default: /usr/local/cuda-5.5/samples)
   -samplespath=<PATH>        : Specify a custom path for Samples location
   -silent                    : Run in silent mode. Implies acceptance of the EULA
   -verbose                   : Run in verbose mode
   -extract=<PATH>            : Extract individual installers from the .run file to PATH
   -optimus                   : Install driver support for Optimus
   -override                  : Overrides the installation checks (compiler, lib, etc)
   -kernel-source-path=<PATH> : Points to a non-default kernel source location
   -tmpdir <PATH>             : Use <PATH> as temporary directory - useful when /tmp is noexec

We can extract all installers from this one :

Terminal.png node:
sh /tmp/cuda_5.5.22_linux_64.run -extract=/tmp && cd /tmp


This will extract 3 installers:

  • NVIDIA-Linux-x86_64-319.37.run: NVIDIA drivers
  • cuda-linux64-rel-5.5.22-16488124.run: CUDA installer
  • cuda-samples-linux-5.5.22-16488124.run: CUDA samples installer


Each of them also have installation options :

Terminal.png node:
sh NVIDIA-Linux-x86_64-319.37.run --help
Terminal.png node:
sh cuda-linux64-rel-5.5.22-16488124.run --help
Terminal.png node:
sh cuda-samples-linux-5.5.22-16488124.run --help

Installation

We are ready to install NVIDIA drivers and CUDA 5.5. We will also use gcc-4.6 which is needed for this installation.

Terminal.png node:
CC=/usr/bin/gcc-4.6 sh NVIDIA-Linux-x86_64-319.37.run --accept-license --silent --disable-nouveau -X --kernel-name=`uname -r`
Terminal.png node:
CC=/usr/bin/gcc-4.6 sh cuda-linux64-rel-5.5.22-16488124.run -noprompt


NVIDIA drivers and CUDA 5.5 are successfully installed, but some more configuration is needed such as add /usr/local/cuda-5.5/bin to $PATH and add /usr/local/cuda-5.5/lib64 and /usr/local/cuda-5.5/lib to $LD_LIBRARY_PATH.

Terminal.png node:
export PATH=$PATH:/usr/local/cuda-5.5/bin
Terminal.png node:
export LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64:/usr/local/cuda-5.5/lib:$LD_LIBRARY_PATH

But this is not permanent after a disconnection from your ssh session, this changes will be lost. To make this permanent, you have to edit /etc/profile and create a file under /etc/ld.so.conf.d/ directory

Terminal.png node:
sed -e "s/:\/bin/:\/bin:\/usr\/local\/cuda-5.5\/bin/" -i /etc/profile
Terminal.png node:
echo -e "/usr/local/cuda-5.5/lib\n/usr/local/cuda-5.5/lib64" > /etc/ld.so.conf.d/cuda.conf && ldconfig


We almost done the installation. The last step have to add a script which will enable GPU and execute it at start-up.

  • Create a file named /usr/local/bin/enable-gpu and copy the following script in it.
#!/bin/bash

if /sbin/modprobe nvidia; then
  if [ "$?" -eq 0 ]; then
    # Count the number of NVIDIA controllers found.
    NVDEVS=`lspci | grep -i NVIDIA`
    N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
    NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

    N=`expr $N3D + $NVGA - 1`
    for i in `seq 0 $N`; do
      mknod -m 666 /dev/nvidia$i c 195 $i
    done

    mknod -m 666 /dev/nvidiactl c 195 255

  else
    exit 1
  fi
fi
  • Give execution rights to the script file:
Terminal.png node:
chmod +x /usr/local/bin/enable-gpu
  • Finally, add a line in the /etc/rc.local file to execute your script at boot time.
Terminal.png node:
sed -e "s/exit 0//" -i /etc/rc.local; echo -e "#Enable GPUs at boot time\nsh /usr/local/bin/enable-gpu\n\nexit 0" >> /etc/rc.local


Check installation

To check if NVIDIA drivers are correctly installed, you can use nvidia-smi tool.

Terminal.png node:
nvidia-smi

This is an example of the result on adonis cluster:

root@adonis-2:~# nvidia-smi 
Wed Dec  4 14:42:08 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.319.37   Driver Version: 319.37         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T10 Proce...  Off  | 0000:0A:00.0     N/A |                  N/A |
| N/A   36C  N/A     N/A /  N/A |        3MB /  4095MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T10 Proce...  Off  | 0000:0C:00.0     N/A |                  N/A |
| N/A   36C  N/A     N/A /  N/A |        3MB /  4095MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+


To check CUDA installation we will compile CUDA examples.

  • Install dependencies
Terminal.png node:
apt-get update && apt-get -y install freeglut3-dev libxmu-dev libxi-dev
  • Uncompress samples and compile them
Terminal.png node:
sh cuda-samples-linux-5.5.22-16488124.run -cudaprefix=/usr/local/cuda-5.5

You will be prompt to accept the EULA and for a installation path. In this example we will install examples in /usr/local/cuda-5.5/samples

  • To compile CUDA samples, go to installation directory and then type make.
  • Once compilation is over, you should be able to run some samples:
Terminal.png node:
/usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery
root@adonis-7:/tmp# /usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  (30) Multiprocessors, (  8) CUDA Cores/MP:     240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              512-bit
  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           12 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  (30) Multiprocessors, (  8) CUDA Cores/MP:     240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              512-bit
  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           10 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 2, Device0 = Tesla T10 Processor, Device1 = Tesla T10 Processor
Result = PASS

Please note that CUDA samples are described on the document named Samples.html located on samples installation directory.
Do not forget to backup your environment:

Terminal.png frontend:
ssh root@`head -1 $OAR_FILE_NODE` tgz-g5k > myimagewithcuda.tgz

Intel Xeon Phi (MIC) on Grid'5000

Reserve a Phi on Nancy site

Xeon Phi are extension boards embedded in regular compute nodes. To reserve a Grid'5000 node that includes a Xeon Phi, use this command:

Terminal.png frontend:
oarsub -I -p "MIC='YES'" -t allow_classic_ssh -t mic

Configuring the Intel compiler to use a licenses server

In order to run programs on Intel Xeon Phi, you should compile them using Intel compilers. An instance is available on /grid5000/compilers/icc13.2/ on Nancy's site.

Intel compilers require commercial license that are not provided by Grid'5000. To use them, you need an access to a token (licenses) server that provides this license. Then, you will use a machine (your laptop for instance) as a bridge between the license server and your Grid'5000 nodes.

Using licenses server

Create your licenses file (you may need a different the port number):

Terminal.png frontend:
mkdir ~/intel
cat <<EOF >> ~/intel/licenses 
SERVER localhost ANY 28618
USE_SERVER
EOF

Then, create an SSH tunnel :

from the command line:

Terminal.png laptop:
ssh -R 28618:LICENSE_SERVER:28618 -R 28619:LICENSE_SERVER:28619 graphite-X.nancy.g5k

or from the SSH configuration (see [1]):

Host g5k
 Hostname access.grid5000.fr
 ...

Host *.intel
 User g5kuser
 ForwardAgent no
 RemoteForward *:28618 LICENSE_SERVER:28618
 RemoteForward *:28619 LICENSE_SERVER:28619
 ProxyCommand ssh g5k "nc -q 0 `basename %h .intel` %p"


Then connect to your node:

Terminal.png laptop:
ssh graphite-X.nancy.intel

Using the Inria licenses server (if you are on an Inria network)

The Inria license server is jetons.inria.fr, ports 29030 and 34430.

Create your licenses file:

Terminal.png frontend:
mkdir ~/intel
cat <<EOF >> ~/intel/licenses 
SERVER localhost ANY 28618
USE_SERVER
EOF

Then, create an SSH tunnel :

from the command line:

Terminal.png laptop:
ssh -R 28618:jetons.inria.fr:29030 -R 34430:jetons.inria.fr:34430 graphite-X.nancy.g5k

or from the SSH configuration (see [2]):

Host g5k
 Hostname access.grid5000.fr
 ...

Host *.intel
 User g5kuser
 ForwardAgent no
 RemoteForward *:28618 jetons.inria.fr:29030
 RemoteForward *:34430 jetons.inria.fr:34430
 ProxyCommand ssh g5k "nc -q 0 `basename %h .intel` %p"


Then connect to your node:

Terminal.png laptop:
ssh graphite-X.nancy.intel

Execution on Xeon Phi

An introduction to the Phi programming environment is available on the Intel website.

Other resources:

Offload mode

In offload mode, your program is started on the node, but part of its execution is offloaded to the Phi.

Compile some source code:

Terminal.png graphite:
source /opt/intel/composerxe/bin/compilervars.sh intel64
Terminal.png graphite:
icpc -openmp /grid5000/xeonphi/samples/reduction.cpp -o reduction-offload

And execute it:

Terminal.png graphite:
./reduction-offload
Note.png Note

This section uses a code snippet from the Intel tutorial

Native mode

In native mode, your program is directly executed inside the Phi. You need to connect to the Phi using SSH first.

Compile some source code:

Terminal.png graphite:
source /grid5000/compilers/icc13.2/bin/compilervars.sh intel64
Terminal.png graphite:
icpc /grid5000/xeonphi/samples/hello.cpp -openmp -mmic -o hello-native

Login on Phi:

Terminal.png graphite:
ssh mic0
Note.png Note

Your home directory is available from inside the mic (the /grid5000 directory as well)

And execute:

Terminal.png graphite-mic0:
source /grid5000/xeonphi/micenv
Terminal.png grahite-mic0:
./hello-native

Use Xeon Phi from a "min" deployed environment

Reserve and deploy a node with MIC

Terminal.png frontend:
oarsub -I -p "MIC='YES'" -t deploy
Terminal.png frontend:
kadeploy3 -f $OAR_NODEFILE -k -e wheezy-x64-min

You can now log on the node:

Terminal.png frontend:
ssh root@`head -1 $OAR_NODE_FILE`

Install Xeon Phi drivers

MPSS drivers are available for wheezy on grid5000 debian repository, you just need to uncomment this line to /etc/apt/sources.list:

deb http://apt.grid5000.fr/debian sid main

Get grid5000 keyring:

Terminal.png node:
apt-get update && apt-get install grid5000-keyring -y --force-yes

Install it:

Terminal.png node:
apt-get install mpss-modules-3.2.0-4-amd64 mpss-micmgmt mpss-miccheck mpss-coi mpss-mpm mpss-metadata mpss-miccheck-bin glibc2.12.2pkg-libsettings0 glibc2.12.2pkg-libmicmgmt0 libscif0 mpss-daemon mpss-boot-files mpss-sdk-k1om intel-composerxe-compat-k1om g++ -y --force-yes

Copy configuration files

Configuration are available on nfs server, so we should mount a partition to retrieve configurations files.

Terminal.png node:
apt-get install nfs-common -y
Terminal.png node:
mount nfs:/export/grid5000 /grid5000/

Mount automaticaly /grid5000 at boot: append the following line to /etc/fstab

 nfs:/export/grid5000/ /grid5000/ nfs defaults 0 0

After that we can copy configuration files:

Terminal.png node:
cp /grid5000/xeonphi/conf/default.conf /etc/mpss/
Terminal.png node:
cp /grid5000/xeonphi/conf/mic0.conf /etc/mpss/
Terminal.png node:
cp /grid5000/xeonphi/conf/mpss /etc/init.d/
Terminal.png node:
cp /grid5000/xeonphi/conf/interfaces /etc/network

Start mpss on boot:

Terminal.png node:
update-rc.d mpss defaults

Add mic module on boot:

Terminal.png node:
echo mic >> /etc/modules

If you want to mount /home and /grid5000 on mic you should:

Add new file on /var/mpss/mic0/etc/fstab

nfs:/export/home	/home	nfs		rsize=8192,wsize=8192,nolock,intr	0	0
nfs:/export/grid5000	/grid5000	nfs		rsize=8192,wsize=8192,nolock,intr	0	0

And add this line on /var/mpss/mic0.filelist

file /etc/fstab etc/fstab 644 0 0
dir /grid5000 755 0 0

Now we can reboot the machine:

Terminal.png node:
reboot

After reboot you can use MIC as you want. See above for more details

Note.png Note

On a freshly installed node, don't forget to recreate a ~/intel/licenses file, except if you manually mounted your frontend home on your node home