GPUs on Grid5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
Line 350: Line 350:
|}
|}


Note that port could be different: for example inria server is jetons.inria.fr and ports are 29030 and 34430.
Note that port could be different: for example inria licence server is jetons.inria.fr and ports are 29030 and 34430.


== Compile on host ==
== Compile on host ==

Revision as of 12:46, 31 January 2014


Purpose

This page presents how to use GPUs on Grid'5000 and how to install your own NVIDIA drivers and CUDA installation.
In this tutorial, we will first compile and use CUDA examples and in the second part, we will install NVIDIA drivers and compile CUDA 5 from a simple wheezy-x64-base environment.

Pre-requisite

  • A basic knowledge of Grid'5000 is require, we suggest you to read Getting Started tutorial first.
  • Information about hardware information and GPUs availability can be found on Special:G5KHardware.

GPU on Grid'5000

Using CUDA

Download and compile examples

GPU are available in:

  • Grenoble (adonis)
  • Lyon (orion)
  • Lille (chirloute)

NVIDIA drivers 304.54 and CUDA 5.0 are installed by default on nodes.
You can reserve a node with GPU using OAR GPU property. For Grenoble and Lyon:

Terminal.png frontend:
oarsub -I -p "GPU='YES'"

or for Lille:

Terminal.png lille:
oarsub -I -p "GPU='SHARED'"
Warning.png Warning

Please note that Lille GPU are shared between nodes and the oarsub command will differ from Grenoble or Lyon. When using GPUs at Lille, you may encounter some trouble, you can read more about Lille GPU on Lille:GPU.

We will then download CUDA 5.0 samples and install them.

Terminal.png node:
sh cuda-samples_5.0.35_linux.run -cudaprefix=/usr/local/cuda-5.0/

You will be prompt to accept the EULA and for a installation path, we suggest you to use /tmp/samples/.

Then you can go to installation path, in our case /tmp/samples/. If you list the directory, you will see a lot of folder starting with a number, these are CUDA examples. CUDA examples are describe in the document named Samples.html. You might also want to have a look to doc directory.

We will now compile examples, this will take a little time. From CUDA samples installation directory (/tmp/samples), run make:

Terminal.png node:
make

The process is complete when "Finished building CUDA samples" is printed. You should be able to run CUDA examples. We will try the one named Device Query which is located in /tmp/samples/1_Utilities/deviceQuery/. This sample enumerates the properties of the CUDA devices present in the system.

Terminal.png node:
/tmp/samples/1_Utilities/deviceQuery/deviceQuery

This is an example of the result on adonis cluster at Grenoble:

ebertoncello@adonis-2:/tmp/samples$ ./1_Utilities/deviceQuery/deviceQuery 
1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla T10 Processor"
 CUDA Driver Version / Runtime Version          5.0 / 5.0
 CUDA Capability Major/Minor version number:    1.3
 Total amount of global memory:                 4096 MBytes (4294770688 bytes)
 (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
 GPU Clock rate:                                1296 MHz (1.30 GHz)
 Memory Clock rate:                             800 Mhz
 Memory Bus Width:                              512-bit
 Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
 Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       16384 bytes
 Total number of registers available per block: 16384
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  1024
 Maximum number of threads per block:           512
 Maximum sizes of each dimension of a block:    512 x 512 x 64
 Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             256 bytes
 Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Disabled
 Device supports Unified Addressing (UVA):      No
 Device PCI Bus ID / PCI location ID:           12 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla T10 Processor"
 CUDA Driver Version / Runtime Version          5.0 / 5.0
 CUDA Capability Major/Minor version number:    1.3
 Total amount of global memory:                 4096 MBytes (4294770688 bytes)
 (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
 GPU Clock rate:                                1296 MHz (1.30 GHz)
 Memory Clock rate:                             800 Mhz
 Memory Bus Width:                              512-bit
 Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
 Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       16384 bytes
 Total number of registers available per block: 16384
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  1024
 Maximum number of threads per block:           512
 Maximum sizes of each dimension of a block:    512 x 512 x 64
 Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             256 bytes
 Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Disabled
 Device supports Unified Addressing (UVA):      No
 Device PCI Bus ID / PCI location ID:           10 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 2, Device0 = Tesla T10 Processor, Device1 = Tesla T10 Processor

Install CUDA from a base environement

Reservation and deployment

We will now install up to date NVIDIA drivers and up to date CUDA (5.5). First reserve a node and then deploy wheezy-x64-base.

Terminal.png frontend:
oarsub -I -t deploy -p "GPU='YES'" -l /nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODE_FILE -e wheezy-x64-base -k


Once the deployment is terminated, you should be able to connect to node and download CUDA 5.5 installer to /tmp.

Terminal.png frontend:
ssh root@`head -1 $OAR_NODE_FILE`


When download is over, you can see installation options from CUDA installer

Terminal.png node:
sh /tmp/cuda_5.5.22_linux_64.run --help
root@adonis-2:/tmp# sh cuda_5.5.22_linux_64.run --help
Options:
   -help                      : Print help message
   -driver                    : Install NVIDIA Display Driver
   -uninstall                 : Uninstall NVIDIA Display Driver
   -toolkit                   : Install CUDA 5.5 Toolkit (default: /usr/local/cuda-5.5)
   -toolkitpath=<PATH>        : Specify a custom path for CUDA location
   -samples                   : Install CUDA 5.5 Samples (default: /usr/local/cuda-5.5/samples)
   -samplespath=<PATH>        : Specify a custom path for Samples location
   -silent                    : Run in silent mode. Implies acceptance of the EULA
   -verbose                   : Run in verbose mode
   -extract=<PATH>            : Extract individual installers from the .run file to PATH
   -optimus                   : Install driver support for Optimus
   -override                  : Overrides the installation checks (compiler, lib, etc)
   -kernel-source-path=<PATH> : Points to a non-default kernel source location
   -tmpdir <PATH>             : Use <PATH> as temporary directory - useful when /tmp is noexec

We can extract all installers from this one :

Terminal.png node:
sh /tmp/cuda_5.5.22_linux_64.run -extract=/tmp && cd /tmp


This will extract 3 installers:

  • NVIDIA-Linux-x86_64-319.37.run: NVIDIA drivers
  • cuda-linux64-rel-5.5.22-16488124.run: CUDA installer
  • cuda-samples-linux-5.5.22-16488124.run: CUDA samples installer


Each of them also have installation options :

Terminal.png node:
sh NVIDIA-Linux-x86_64-319.37.run --help
Terminal.png node:
sh cuda-linux64-rel-5.5.22-16488124.run --help
Terminal.png node:
sh cuda-samples-linux-5.5.22-16488124.run --help


Installation

We are ready to install NVIDIA drivers and CUDA 5.5. We will also use gcc-4.6 which is needed for this installation.

Terminal.png node:
CC=/usr/bin/gcc-4.6 sh NVIDIA-Linux-x86_64-319.37.run --accept-license --silent --disable-nouveau -X --kernel-name=`uname -r`
Terminal.png node:
CC=/usr/bin/gcc-4.6 sh cuda-linux64-rel-5.5.22-16488124.run -noprompt


NVIDIA drivers and CUDA 5.5 are successfully installed, but some more configuration is needed such as add /usr/local/cuda-5.5/bin to $PATH and add /usr/local/cuda-5.5/lib64 and /usr/local/cuda-5.5/lib to $LD_LIBRARY_PATH.

Terminal.png node:
export PATH=$PATH:/usr/local/cuda-5.5/bin
Terminal.png node:
export LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64:/usr/local/cuda-5.5/lib:$LD_LIBRARY_PATH

But this is not permanent after a disconnection from your ssh session, this changes will be lost. To make this permanent, you have to edit /etc/profile and create a file under /etc/ld.so.conf.d/ directory

Terminal.png node:
sed -e "s/:\/bin/:\/bin:\/usr\/local\/cuda-5.5\/bin/" -i /etc/profile
Terminal.png node:
echo -e "/usr/local/cuda-5.5/lib\n/usr/local/cuda-5.5/lib64" > /etc/ld.so.conf.d/cuda.conf && ldconfig


We almost done the installation. The last step have to add a script which will enable GPU and execute it at start-up.

  • Create a file named /usr/local/bin/enable-gpu and copy the following script in it.
#!/bin/bash

if /sbin/modprobe nvidia; then
  if [ "$?" -eq 0 ]; then
    # Count the number of NVIDIA controllers found.
    NVDEVS=`lspci | grep -i NVIDIA`
    N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
    NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

    N=`expr $N3D + $NVGA - 1`
    for i in `seq 0 $N`; do
      mknod -m 666 /dev/nvidia$i c 195 $i
    done

    mknod -m 666 /dev/nvidiactl c 195 255

  else
    exit 1
  fi
fi
  • Give execution rights to the script file:
Terminal.png node:
chmod +x /usr/local/bin/enable-gpu
  • Finally, add a line in the /etc/rc.local file to execute your script at boot time.
Terminal.png node:
sed -e "s/exit 0//" -i /etc/rc.local; echo -e "#Enable GPUs at boot time\nsh /usr/local/bin/enable-gpu\n\nexit 0" >> /etc/rc.local


Check installation

To check if NVIDIA drivers are correctly installed, you can use nvidia-smi tool.

Terminal.png node:
nvidia-smi

This is an example of the result on adonis cluster:

root@adonis-2:~# nvidia-smi 
Wed Dec  4 14:42:08 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.319.37   Driver Version: 319.37         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T10 Proce...  Off  | 0000:0A:00.0     N/A |                  N/A |
| N/A   36C  N/A     N/A /  N/A |        3MB /  4095MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T10 Proce...  Off  | 0000:0C:00.0     N/A |                  N/A |
| N/A   36C  N/A     N/A /  N/A |        3MB /  4095MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+


To check CUDA installation we will compile CUDA examples.

  • Install dependencies
Terminal.png node:
http_proxy=http://proxy.site.grid5000.fr:3128 apt-get update && apt-get -y install freeglut3-dev libxmu-dev libxi-dev
  • Uncompress samples and compile them
Terminal.png node:
sh cuda-samples-linux-5.5.22-16488124.run -cudaprefix=/usr/local/cuda-5.5

You will be prompt to accept the EULA and for a installation path. In this example we will install examples in /usr/local/cuda-5.5/samples

  • To compile CUDA samples, go to installation directory and then type make.
  • Once compilation is over, you should be able to run some samples:
Terminal.png node:
/usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery
root@adonis-7:/tmp# /usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  (30) Multiprocessors, (  8) CUDA Cores/MP:     240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              512-bit
  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           12 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  (30) Multiprocessors, (  8) CUDA Cores/MP:     240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              512-bit
  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           10 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 2, Device0 = Tesla T10 Processor, Device1 = Tesla T10 Processor
Result = PASS

Please note that CUDA samples are described on the document named Samples.html located on samples installation directory.
Do not forget to backup your environment:

Terminal.png node:
ssh root@`head -1 $OAR_FILE_NODE` tgz-g5k > myimagewithcuda.tgz

Using OpenCL

Intel MIC on Grid'5000

Reserve MIC on Nancy site

Terminal.png frontend:
oarsub -I -p "MIC='YES'"

Configure Intel compiler

In order to run programs on Intel Xéon phi, you should compile with Intel compiler. An instance is available on /grid5000/compilers/icc13.2/ on Nancy's site. You must have a machine that can access to the token server and to the grid5000 nodes. If you have both, you should create licenses file (adapt with your server):

Terminal.png frontend:
mkdir ~/intel
cat <<EOF >> ~/intel/licenses 
 SERVER license_server ANY 28618
 USE_SERVER
EOF

Then you should create a reverse:

With tow reverse command:

Terminal.png laptop:
ssh -R 28618:license_server:28618 graphite-X.nancy.g5k
Terminal.png laptop:
ssh -R 28619:license_server:28619 graphite-X.nancy.g5k

Or by ssh configuration:

Host accessuser
 Hostname access.grid5000.fr
 User g5kuser
 ForwardAgent no
 Port 22
 IdentityFile ~/.ssh/id_rsa

Host *.intel
 User g5kuser
 ForwardAgent no
 RemoteForward *:28518 license_server:28518
 RemoteForward *:28519 license_server:28519
 ProxyCommand ssh accessuser "nc -q 0 `basename %h .intel` %p"
 IdentityFile ~/.ssh/id_rsa

Then connect to node:

Terminal.png laptop:
ssh graphite-X.nancy.intel

Note that port could be different: for example inria licence server is jetons.inria.fr and ports are 29030 and 34430.

Compile on host

Tutorial from: http://www.hpc.cineca.it/content/quick-guide-intel-mic-usage

with offload

Simple example can be find here: http://www.hpc.cineca.it/content/hellooffloadcpp

Terminal.png graphite:
source /grid5000/compilers/icc13.2/bin/compilervars.sh intel64

You can compile with:

Terminal.png graphite:
icpc -openmp hello_offload.cpp -o exe-offload

And execute:

Terminal.png graphite-mic0:
./exe-offload

Native

Simple example can be find here: http://www.hpc.cineca.it/content/hellonativecpp

Terminal.png laptop:
source /grid5000/compilers/icc13.2/bin/compilervars.sh intel64

You can compile with:

Terminal.png graphite:
icpc hello_native.cpp -openmp -mmic -o exe-native

Login on mic:

Terminal.png graphite:
ssh mic0

Add LD_LIBRARY:

Terminal.png graphite-mic0:
export LD_LIBRARY_PATH=/grid5000/compilers/icc13.2/composer_xe_2013/mkl/lib/mic:${LD_LIBRARY_PATH}
Terminal.png graphite-mic0:
export LD_LIBRARY_PATH=/grid5000/compilers/icc13.2/composer_xe_2013/tbb/lib/mic:${LD_LIBRARY_PATH}
Terminal.png graphite-mic0:
export LD_LIBRARY_PATH=/grid5000/compilers/icc13.2/composer_xe_2013/lib/mic:${LD_LIBRARY_PATH}

And execute:

Terminal.png grahite-mic0:
./exe-native

Install MIC from min environment