GPUs on Grid5000: Difference between revisions
Jgaidamour (talk | contribs) |
Jgaidamour (talk | contribs) (New tuto) |
||
Line 1: | Line 1: | ||
{{Maintainer|Simon Delamare}} | {{Maintainer|Simon Delamare}} | ||
{{Author|Elodie Bertoncello}} | {{Author|Elodie Bertoncello}} | ||
{{Author|Emile Morel}} | |||
{{Author|Jérémie Gaidamour}} | |||
{{Portal|User}} | {{Portal|User}} | ||
= | = Introduction = | ||
This | This tutorial presents how to use GPU Accelerators and Intel Xeon Phi Coprocessors on Grid'5000. You will learn to reserve these resources, setup the environnement and execute codes on the accelerators. The dense matrix-matrix multiplication example of this tutorial can be used as a toy benchmark to compare the performance of accelerators and/or [https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms BLAS] implementations. Please note that this page is not about GPU or Xeon Phi programming and only focus on the specificities of the Grid'5000 plateform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is useful for either testing the latest version of a driver or ensuring the reproducibility of your experiments (by freezing its context). | ||
This tutorial is divided into two distinct parts that can be done in any order: | |||
* [[#GPU accelerators on Grid'5000]] | |||
* [[#Intel Xeon Phi (MIC) on Grid'5000]] | |||
For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the [[Getting Started]] tutorial first to get familiar with the plateform (connections to the plateform, resource reservations) and its basic concepts (job scheduling, environment deployment). | |||
[[Special:G5KHardware]] is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see [Status]). | |||
= GPU accelerators on Grid'5000 = | |||
In this section, we first reserve a GPU node. We then compile and execute examples provided by the CUDA Toolkit on the default (production) environment. We also run our [https://developer.nvidia.com/cublas|CUBLAS] example to illustrate GPU performance for dense matrix multiply. Finally, we deploy a jessie-x64-base environment and install the NVIDIA drivers and compilers before validating the installation on the previous example set. | |||
http://docs.nvidia.com/cuda/cuda-samples/index.html#matrix-multiplication--cublas- | |||
= GPU | == Selection of GPU nodes == | ||
You can reserve a GPU node by simply requesting resources with the OAR "GPU" property: | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "GPU='YES'"}} | |||
At Lille (on chirloute), you have to use the GPU='SHARED" property instead: | |||
{{Term|location=lille|cmd=<code class="command">oarsub</code> -I -p "GPU='SHARED'"}} | |||
The reason is that those GPUs shared enclosures by groups of four and can only be rebooted in groups. You may encounter some difficulties on those shared GPUs but if | |||
nvidia-smi -q does not detect the GPU on your node, you can find troubleshooting information on [[Lille:GPU|this page]]. | |||
At Nancy, you have to use the production queue to get ressources from graphite (and you also have to comply with the [UserCharter|usage policiy] of production resources) | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "GPU='YES'" -q production}} | |||
NVIDIA drivers 346.22 (see `nvidia-smi`) and CUDA 7.0 (`nvcc --version`) compilation tools are installed by default on nodes. This version of the drivers only support the most recent GPU accelerators such as the GPU installed on orion (Lyon) and graphite (in the [Nancy:Production|production queue]] of Nancy). | |||
You can use the GPU accelerators of adonis (Grenoble) and chirloute (Lille) with the NVIDIA 340.xx Legacy drivers on deployed environments. You can deploy the ready-to-use wheezy-x64-prod environment or see [#Installing the CUDA toolkit on a deployed environement] for using those GPUs with Debian Jessie. | |||
== | == Downloading the CUDA Toolkit examples == | ||
We download CUDA 7.0 samples and extract them on /tmp/samples: | |||
{{Term|location=node|cmd=cd /tmp; <code class="command">wget</code> http://git.grid5000.fr/sources/cuda-samples-linux-7.0.28-19326674.run}} | |||
{{Term|location=node|cmd=<code class="command">sh</code> cuda-samples-linux-7.0.28-19326674.run -noprompt -prefix=/tmp/samples}} | |||
{{Note|text=These samples are part of the [https://developer.nvidia.com/cuda-toolkit-70 CUDA 7.0 Toolkit] and can also be extracted from the toolkit installer using the ''--extract=/path'' option.}} | |||
{{ | {{Note|text= On adonis and chirloute, install the CUDA 5.0 samples (cuda-samples_5.0.35_linux.run).}} | ||
The CUDA examples are described in <code class="file">/tmp/samples/Samples.html</code>. You might also want to have a look at the <code class="file">doc</code> directory or the [http://docs.nvidia.com/cuda/cuda-samples/index.html#getting-started-with-cuda-samples online documentation]. | |||
== Compiling the CUDA Toolkit examples == | |||
You can also compile all the examples at once but it will take a while. From the CUDA samples source directory (<code class="file">/tmp/samples</code>), run make to compile examples: | |||
{{Term|location=node|cmd=<code class="command">cd /tmp/samples</code>}} | {{Term|location=node|cmd=<code class="command">cd /tmp/samples</code>}} | ||
{{Term|location=node|cmd=<code class="command">make -j8</code>}} | {{Term|location=node|cmd=<code class="command">make -j8</code>}} | ||
The compilation of all the examples is over when "Finished building CUDA samples" is printed. Alternatively, each example can also be compiled separately from its own directory. | |||
You can first try the <code class="file">Device Query</code> example located in <code class="file">/tmp/samples/1_Utilities/deviceQuery/</code>. It enumerates the properties of the CUDA devices present in the system. | |||
{{Term|location=node|cmd=<code class="command">/tmp/samples/1_Utilities/deviceQuery/deviceQuery</code>}} | {{Term|location=node|cmd=<code class="command">/tmp/samples/1_Utilities/deviceQuery/deviceQuery</code>}} | ||
Here is an example of the result on the orion cluster at Lyon: | |||
<code> | |||
orion-2:/tmp/samples/1_Utilities/deviceQuery/deviceQuery | |||
/tmp/samples/1_Utilities/deviceQuery/deviceQuery Starting... | |||
CUDA Device Query (Runtime API) version (CUDART static linking) | CUDA Device Query (Runtime API) version (CUDART static linking) | ||
Detected 1 CUDA Capable device(s) | |||
Device 0: "Tesla M2075" | |||
CUDA Driver Version / Runtime Version | CUDA Driver Version / Runtime Version 7.0 / 7.0 | ||
CUDA Capability Major/Minor version number: | CUDA Capability Major/Minor version number: 2.0 | ||
Total amount of global memory: | Total amount of global memory: 5375 MBytes (5636554752 bytes) | ||
( | (14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores | ||
GPU Clock rate: | GPU Max Clock rate: 1147 MHz (1.15 GHz) | ||
Memory Clock rate: | Memory Clock rate: 1566 Mhz | ||
Memory Bus Width: | Memory Bus Width: 384-bit | ||
L2 Cache Size: 786432 bytes | |||
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048) | |||
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers | |||
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers | |||
Total amount of constant memory: 65536 bytes | Total amount of constant memory: 65536 bytes | ||
Total amount of shared memory per block: | Total amount of shared memory per block: 49152 bytes | ||
Total number of registers available per block: | Total number of registers available per block: 32768 | ||
Warp size: 32 | Warp size: 32 | ||
Maximum number of threads per multiprocessor: | Maximum number of threads per multiprocessor: 1536 | ||
Maximum number of threads per block: | Maximum number of threads per block: 1024 | ||
Max dimension size of a thread block (x,y,z): (1024, 1024, 64) | |||
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535) | |||
Maximum memory pitch: 2147483647 bytes | Maximum memory pitch: 2147483647 bytes | ||
Texture alignment: | Texture alignment: 512 bytes | ||
Concurrent copy and kernel execution: Yes with | Concurrent copy and kernel execution: Yes with 2 copy engine(s) | ||
Run time limit on kernels: No | Run time limit on kernels: No | ||
Integrated GPU sharing Host Memory: No | Integrated GPU sharing Host Memory: No | ||
Support host page-locked memory mapping: Yes | Support host page-locked memory mapping: Yes | ||
Alignment requirement for Surfaces: Yes | Alignment requirement for Surfaces: Yes | ||
Device has ECC support: | Device has ECC support: Enabled | ||
Device supports Unified Addressing (UVA): | Device supports Unified Addressing (UVA): Yes | ||
Device PCI Bus ID / | Device PCI Domain ID / Bus ID / location ID: 0 / 66 / 0 | ||
Compute Mode: | Compute Mode: | ||
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > | < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > | ||
=== | deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla M2075 | ||
Result = PASS | |||
</code> | |||
== BLAS examples == | |||
The toolkit provides the [https://developer.nvidia.com/cublas|CUBLAS] library which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [http://docs.nvidia.com/cuda/cublas/index.html|here] and several [http://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries|advanced examples] using CUBLAS are also available (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...) in the toolkit distribution. | |||
The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays but the toolkit also provides [http://docs.nvidia.com/cuda/nvblas/|NVBLAS], a library that automatically *offload* compute-intensive BLAS3 routines (ie. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be [http://www.manpages.info/linux/ld.so.8.html|forcibly linked] using the LD_PRELOAD environment variable. | |||
/tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS | |||
To test NVBLAS, you can use the matrix-matrix multiplication example available in grid5000/xeonphi/samples/matmatmul/. | |||
To compile: | |||
cp -r /grid5000/xeonphi/samples/matmatmul/ /tmp/ | |||
cd /tmp/matmatmul | |||
make | |||
To run on the CPU, use: | |||
orion-2: ./matmatmul | |||
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) | |||
BLAS - Time elapsed: 2.672E+01 sec. | |||
To offload the computation on the GPU, use: | |||
orion-2: cd /tmp/matmatmul | |||
orion-2: echo "NVBLAS_CPU_BLAS_LIB /usr/lib/libblas/libblas.so" > nvblas.conf | |||
orion-2: LD_PRELOAD=libnvblas.so ./matmatmulc | |||
[NVBLAS] Config parsed | |||
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) | |||
BLAS - Time elapsed: 1.716E+00 sec. | |||
If you want to measure the time spent on data transfers to the GPU, you can compare the results with the simpleCUBLAS example and instrument the simpleCUBLAS example with timers. | |||
=== Installing the CUDA toolkit on a deployed environement === | |||
GPU nodes at Lyon and Nancy are supported by the latest GPU drivers. For Lille and Grenoble, you have to install the NVIDIA 340.xx Legacy drivers. The following table resumes the situation on Grid'5000 as of January 2016: | |||
{| class="program" style="border:1px dotted black;" | |||
! Site | |||
! Cluster | |||
! GPU | |||
! OAR property | |||
! Driver version | |||
! CUDA toolkit version | |||
|- | |||
| Lyon | |||
| orion (4 nodes) | |||
| Nvidia Tesla-M2075 (1 per node) | |||
| -p "GPU='YES'" | |||
| 346.xx (jessie), 352.xx | |||
| CUDA 7.0 (jessie), 7.5 | |||
|- | |||
| Nancy | |||
| graphique (6 nodes) | |||
| Nvidia GTX 980 GPU (2 per node) | |||
| -p "GPU='YES'" -q production | |||
| 346.xx (jessie), 352.xx | |||
| CUDA 7.0 (jessie), 7.5 | |||
|- | |||
| Grenoble | |||
| adonis (10 nodes) | |||
| Nvidia Tesla-C1060 (2 per node) | |||
| -p "GPU='YES'" | |||
| 340.xx | |||
| CUDA 6.5 | |||
|- | |||
| Lille | |||
| chirloute (8 nodes) | |||
| Nvidia Tesla-S2050 (1 per node) | |||
| -p "GPU='SHARED'" | |||
| 340.xx | |||
| CUDA 6.5 | |||
|} | |||
==== | ==== Deployment ==== | ||
First, reserve a GPU node and deploy the <code class="file">jessie-x64-base</code> environment: | |||
First reserve a node and | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "GPU='<code class="replace">YES</code>'" -l /nodes=1,walltime=2}} | {{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "GPU='<code class="replace">YES</code>'" -l /nodes=1,walltime=2}} | ||
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e | {{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e jessie-x64-base -k}} | ||
Once the deployment is terminated, you should be able to connect to node as root: | |||
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}} | |||
==== Downloading the NVIDIA toolkit ==== | |||
We will now install the NVIDIA drivers, compilers, librairies and examples. The complete CUDA distribution can be download from https://developer.nvidia.com/cuda-toolkit-archive|the official website] or from git.grid5000.fr/sources/. Select a toolkit version compatible with your GPU hardware: | |||
cd /tmp/; wget git.grid5000.fr/sources/cuda_7.5.18_linux.run | |||
cd /tmp/; wget git.grid5000.fr/sources/cuda_7.0.28_linux.run | |||
cd /tmp/; wget git.grid5000.fr/sources/cuda_6.5.14_linux_64.run # chirloute, adonis | |||
=== | <br />When download is over, you can look at the installer options: | ||
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_<version>.run --help}} | |||
There is actually three distincts installers (for the drivers, compilers and examples) embedded on this file and you can extract them using: | |||
{{Term|location=node|cmd= | {{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_<version>.run -extract=/tmp/installers && <code class="command">cd</code> /tmp}} | ||
It extracts 3 files. For example, with cuda_7.5.18_linux.run, you obtain: | |||
* NVIDIA-Linux-x86_64-352.39.run: the drivers installer (version 352.39) | |||
* cuda-linux64-rel-7.5.18-19867135.run: the CUDA toolkit installer (ie. compilers, librairies) | |||
* cuda-samples-linux-7.5.18-19867135.run: the CUDA samples installer | |||
Each installers provides a --help option. | |||
==== Driver installation ==== | |||
To install the linux driver (ie. kernel module), we need the kernel header files and gcc 4.8 (as the module should be compiled with the same version of gcc used to compile the kernel in the first place): | |||
<code> | |||
apt-get -y update && apt-get -y upgrade | |||
apt-get -y install make | |||
apt-get -y install linux-headers-amd64 # it also installs gcc-4.8 | |||
</code> | |||
To compile and install the kernel module, use: | |||
cd /tmp/installers | |||
CC=gcc-4.8 sh NVIDIA-Linux-x86_<version>.run --accept-license --silent --no-install-compat32-libs # note: do not use --no-install-compat32-libs with CUDA 6.5 | |||
(warnings about X.Org can safely be ignored) | |||
To install the CUDA toolkit, use: | |||
{{Term|location=node|cmd=<code class="command">sh</code> cuda-linux64-rel-<version>.run -noprompt}} | |||
{{Term|location=node|cmd=<code class="command"> | |||
You can add the CUDA toolkit to your current shell environment by using: | |||
export PATH=$PATH:/usr/local/cuda-<version>/bin | |||
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-<VERSION>/lib64 | |||
==== | To make those environment variables permanent for future ssh sessions you can add them to ~/.bashrc. Alternatively, you can edit the default PATH in <code class="file">/etc/profile</code> and add a configuration file for the dynamic linker under <code class="file">/etc/ld.so.conf.d/</code> as follow: | ||
{{Term|location=node|cmd=<code class="command">sed</code> -e "s/:\/bin/:\/bin:\/usr\/local\/cuda-<version>\/bin/" -i /etc/profile}} | |||
{{Term|location=node|cmd=<code class="command">echo</code> -e "/usr/local/cuda-5.5/lib\n/usr/local/cuda-<version>/lib64" > /etc/ld.so.conf.d/cuda.conf}} | |||
You also need to run <code class="command">ldconfig</code> as root to update the linker configuration. | |||
To check if NVIDIA drivers are correctly installed, you can use nvidia-smi tool | To check if NVIDIA drivers are correctly installed, you can use the nvidia-smi tool: | ||
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}} | {{Term|location=node|cmd=<code class="command">nvidia-smi</code>}} | ||
Here is an example of the result on adonis cluster: | |||
root@adonis-2:~# nvidia-smi | root@adonis-2:~# nvidia-smi | ||
Wed Dec 4 14:42:08 2013 | Wed Dec 4 14:42:08 2013 | ||
Line 228: | Line 248: | ||
| N/A 36C N/A N/A / N/A | 3MB / 4095MB | N/A Default | | | N/A 36C N/A N/A / N/A | 3MB / 4095MB | N/A Default | | ||
+-------------------------------+----------------------+----------------------+ | +-------------------------------+----------------------+----------------------+ | ||
Then, you can compile and run the toolkit examples. You need the g++ compiler to do so: | |||
apt-get install g++ | |||
sh cuda-samples-linux-7.5.18-19867135.run -noprompt -prefix=/tmp/samples -cudaprefix=/usr/local/cuda-7.5/ | |||
cd /tmp/samples | |||
make -j8 | |||
See [#Compiling the CUDA Toolkit examples] for more information. | |||
You can save your newly created environment with [TGZ-G5K|tgz-g5k]: | |||
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_FILE_NODE` tgz-g5k > <code class="replace">myimagewithcuda</code>.tgz}} | {{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_FILE_NODE` tgz-g5k > <code class="replace">myimagewithcuda</code>.tgz}} | ||
= Intel Xeon Phi (MIC) on Grid'5000 = | = Intel Xeon Phi (MIC) on Grid'5000 = | ||
== Reserve a Phi | == Reserve a Xeon Phi at Nancy == | ||
As NVIDIA GPU, [https://en.wikipedia.org/wiki/Xeon_Phi|Xeon Phi] coprocessor cards provides additional compute power and can be used to offload computations. As those extension cards run a modified Linux kernel, it is also possible to log in directly onto the Xeon Phi via ssh. Also, | |||
it is possible to compile application for the Xeon Phi processor (which is based on x86 technology) and runs it natively on the embedded Linux system of the Xeon Phi card. Xeon Phi [http://ark.intel.com/products/75799/Intel-Xeon-Phi-Coprocessor-7120P-16GB-1_238-GHz-61-core|7120P] are available at Nancy. | |||
To reserve a Grid'5000 node that includes a Xeon Phi, you can use this command: | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "MIC='YES'" -t allow_classic_ssh -t mic}} | {{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "MIC='YES'" -t allow_classic_ssh -t mic}} | ||
== Configuring the Intel compiler to use a | == Configuring the Intel compiler to use a license server == | ||
In order to compile programs for the Intel Xeon Phi, you have to use Intel compilers. Intel compilers are available in /grid5000/compilers/icc13.2/ at Nancy but they require a commercial (or academic) license and such licenses are not provided by Grid'5000. You might have access to a license server at your local laboratory. For instance, you have access to the Inria license server if you are on the Inria network. You can then use your machine as a bridge between the license server of your local network and your Grid'5000 nodes by creating an SSH tunnel. The procedure is explain below. Alternativly, you can compile your programs elsewhere and copy your executables on Grid'5000. | |||
Note that [https://gcc.gnu.org/wiki/Offloading|GCC] and [Clang|http://openmp.llvm.org/] also provides a limited support for newest Xeon Phi. See [https://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm|this page] for more information about Third Party Tools. | |||
=== Using a license server === | |||
In the following, we will setup a SSH tunnel between a license server and a Grid'5000 node (graphite-X). The Intel compilers will be configured to use localhost:28618 as the license server and the SSH tunnel will forward connections from localhost:28618 to the license server (you can use any local port number for this). On the following, we use the Inria license server named '''jetons.inria.fr''', ports '''29030''' and '''34430'''. | |||
On the Nancy frontend, create a license configuration file for the Intel compilers: | |||
{{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/intel}} | {{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/intel}} | ||
Line 338: | Line 288: | ||
EOF | EOF | ||
Then, | Then, start an SSH tunnel: | ||
{{Term|location=laptop|cmd=<code class="command">ssh</code> -R 28618:jetons.inria.fr:29030 -R 34430:jetons.inria.fr:34430 graphite-<code class="replace">X</code>.nancy.g5k}} | |||
The previous command open a shell session that can be used directly. You should keep it open as long as you need the Intel compilers. | |||
You can also add the tunnel setup to your [https://www.grid5000.fr/mediawiki/index.php/SSH#Setting_up_a_user_config_file|SSH configuration file] (.ssh/config): | |||
Host g5k | Host g5k | ||
Hostname access.grid5000.fr | Hostname access.grid5000.fr | ||
[...] | |||
Host *.intel | Host *.intel | ||
User <code class="replace"> | User <code class="replace">g5klogin</code> | ||
ForwardAgent no | ForwardAgent no | ||
RemoteForward *:28618 jetons.inria.fr:29030 | RemoteForward *:28618 jetons.inria.fr:29030 | ||
RemoteForward *:34430 jetons.inria.fr:34430 | RemoteForward *:34430 jetons.inria.fr:34430 | ||
ProxyCommand ssh g5k " | ProxyCommand ssh g5k -W "$(basename %h .intel):%p" | ||
Then, to create the tunnel and connect to your node, you can simply use: | |||
{{Term|location=laptop|cmd=<code class="command">ssh</code> graphite-<code class="replace">X</code>.nancy.intel}} | |||
To test the tunnel, you can do: | |||
{{Term|location=graphite|cmd=<code class="command">source</code> /opt/intel/composerxe/bin/compilervars.sh intel64}} | |||
{{Term|location=graphite|cmd=<code class="command">icc</code> -v}} | |||
Using Intel compilers on Grid'5000 can be rather slow due to the license server connection. | |||
== Execution on Xeon Phi == | == Execution on Xeon Phi == | ||
Other resources: | An introduction to the Xeon Phi programming environment is available [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment on the Intel website]. Other useful resources include: | ||
* [http://spscicomp.org/wordpress/pages/the-intel-xeon-phi/ The IBM HPC Systems Scientific Computing User Group] | * [http://spscicomp.org/wordpress/pages/the-intel-xeon-phi/ The IBM HPC Systems Scientific Computing User Group Tutorial] | ||
* [http://www.hpc.cineca.it/content/quick-guide-intel-mic-usage CINECA/SCAI]. | * [http://www.hpc.cineca.it/content/quick-guide-intel-mic-usage CINECA/SCAI documentation]. | ||
You can check the status of the MIC card using micinfo: | |||
{{Term|location=graphite|cmd=<code class="command">micinfo</code>}} | |||
=== Offload mode === | === Offload mode === | ||
In offload mode, your program is | In offload mode, your program is executed on the Host, but part of its execution is offloaded to the co-processor card. | ||
{{Note|text=This section uses a code snippet from the [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment Intel tutorial]}} | |||
Compile some source code: | Compile some source code: | ||
cd /tmp | |||
{{Term|location=graphite|cmd=<code class="command">source</code> /opt/intel/composerxe/bin/compilervars.sh intel64}} | {{Term|location=graphite|cmd=<code class="command">source</code> /opt/intel/composerxe/bin/compilervars.sh intel64}} | ||
{{Term|location=graphite|cmd=<code class="command">icpc</code> -openmp /grid5000/xeonphi/samples/reduction.cpp -o reduction-offload}} | {{Term|location=graphite|cmd=<code class="command">icpc</code> -openmp /grid5000/xeonphi/samples/reduction.cpp -o reduction-offload}} | ||
And execute it: | And execute it: | ||
{{Term|location=graphite|cmd=<code class="command">./reduction-offload</code>}} | {{Term|location=graphite|cmd=<code class="command">./reduction-offload</code>}} | ||
=== Native mode === | === Native mode === | ||
In native mode, your program is | In native mode, your program is completely executed on the Xeon Phi. Your code must be compiled natively for the Xeon Phi architecture using the -mmic option: | ||
{{Term|location=graphite|cmd=<code class="command">icpc</code> /grid5000/xeonphi/samples/hello.cpp -openmp -mmic -o hello-native}} | {{Term|location=graphite|cmd=<code class="command">icpc</code> /grid5000/xeonphi/samples/hello.cpp -openmp -mmic -o hello-native}} | ||
You need to connect to the card using SSH first. | |||
Login on Phi: | Login on Phi: | ||
Line 448: | Line 354: | ||
And execute: | And execute: | ||
{{Term|location=graphite-mic0|cmd=<code class="command"> source</code> /grid5000/xeonphi/micenv}} | |||
{{Term|location=grahite-mic0|cmd=<code class="command"> ./hello-native</code>}} | |||
{{ | {{Term|location=graphite-mic0|cmd=source /grid5000/software/intel/mkl/bin/mklvars.sh mic}} | ||
{{Term|location=graphite-mic0|cmd=icpc matmatmul_mkl.c -openmp -o matmatmul_mkl -mkl}} | |||
{{Term|location=graphite-mic0|cmd=icpc matmatmul_mkl.c -openmp -o matmatmul_mkl -mkl -mmic}} |
Revision as of 11:15, 20 January 2016
Introduction
This tutorial presents how to use GPU Accelerators and Intel Xeon Phi Coprocessors on Grid'5000. You will learn to reserve these resources, setup the environnement and execute codes on the accelerators. The dense matrix-matrix multiplication example of this tutorial can be used as a toy benchmark to compare the performance of accelerators and/or BLAS implementations. Please note that this page is not about GPU or Xeon Phi programming and only focus on the specificities of the Grid'5000 plateform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is useful for either testing the latest version of a driver or ensuring the reproducibility of your experiments (by freezing its context).
This tutorial is divided into two distinct parts that can be done in any order:
* #GPU accelerators on Grid'5000 * #Intel Xeon Phi (MIC) on Grid'5000
For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the Getting Started tutorial first to get familiar with the plateform (connections to the plateform, resource reservations) and its basic concepts (job scheduling, environment deployment). Special:G5KHardware is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see [Status]).
GPU accelerators on Grid'5000
In this section, we first reserve a GPU node. We then compile and execute examples provided by the CUDA Toolkit on the default (production) environment. We also run our [1] example to illustrate GPU performance for dense matrix multiply. Finally, we deploy a jessie-x64-base environment and install the NVIDIA drivers and compilers before validating the installation on the previous example set.
http://docs.nvidia.com/cuda/cuda-samples/index.html#matrix-multiplication--cublas-
Selection of GPU nodes
You can reserve a GPU node by simply requesting resources with the OAR "GPU" property:
At Lille (on chirloute), you have to use the GPU='SHARED" property instead:
The reason is that those GPUs shared enclosures by groups of four and can only be rebooted in groups. You may encounter some difficulties on those shared GPUs but if nvidia-smi -q does not detect the GPU on your node, you can find troubleshooting information on this page.
At Nancy, you have to use the production queue to get ressources from graphite (and you also have to comply with the [UserCharter|usage policiy] of production resources)
NVIDIA drivers 346.22 (see `nvidia-smi`) and CUDA 7.0 (`nvcc --version`) compilation tools are installed by default on nodes. This version of the drivers only support the most recent GPU accelerators such as the GPU installed on orion (Lyon) and graphite (in the [Nancy:Production|production queue]] of Nancy). You can use the GPU accelerators of adonis (Grenoble) and chirloute (Lille) with the NVIDIA 340.xx Legacy drivers on deployed environments. You can deploy the ready-to-use wheezy-x64-prod environment or see [#Installing the CUDA toolkit on a deployed environement] for using those GPUs with Debian Jessie.
Downloading the CUDA Toolkit examples
We download CUDA 7.0 samples and extract them on /tmp/samples:
![]() |
Note |
---|---|
These samples are part of the CUDA 7.0 Toolkit and can also be extracted from the toolkit installer using the --extract=/path option. |
The CUDA examples are described in /tmp/samples/Samples.html
. You might also want to have a look at the doc
directory or the online documentation.
Compiling the CUDA Toolkit examples
You can also compile all the examples at once but it will take a while. From the CUDA samples source directory (/tmp/samples
), run make to compile examples:
The compilation of all the examples is over when "Finished building CUDA samples" is printed. Alternatively, each example can also be compiled separately from its own directory.
You can first try the Device Query
example located in /tmp/samples/1_Utilities/deviceQuery/
. It enumerates the properties of the CUDA devices present in the system.
Here is an example of the result on the orion cluster at Lyon:
orion-2:/tmp/samples/1_Utilities/deviceQuery/deviceQuery
/tmp/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla M2075"
CUDA Driver Version / Runtime Version 7.0 / 7.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636554752 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Max Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 66 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla M2075
Result = PASS
BLAS examples
The toolkit provides the [2] library which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [3] and several examples using CUBLAS are also available (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...) in the toolkit distribution.
The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays but the toolkit also provides [4], a library that automatically *offload* compute-intensive BLAS3 routines (ie. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be linked using the LD_PRELOAD environment variable.
/tmp/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS
To test NVBLAS, you can use the matrix-matrix multiplication example available in grid5000/xeonphi/samples/matmatmul/. To compile:
cp -r /grid5000/xeonphi/samples/matmatmul/ /tmp/ cd /tmp/matmatmul make
To run on the CPU, use:
orion-2: ./matmatmul Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 2.672E+01 sec.
To offload the computation on the GPU, use:
orion-2: cd /tmp/matmatmul orion-2: echo "NVBLAS_CPU_BLAS_LIB /usr/lib/libblas/libblas.so" > nvblas.conf orion-2: LD_PRELOAD=libnvblas.so ./matmatmulc [NVBLAS] Config parsed Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 1.716E+00 sec.
If you want to measure the time spent on data transfers to the GPU, you can compare the results with the simpleCUBLAS example and instrument the simpleCUBLAS example with timers.
Installing the CUDA toolkit on a deployed environement
GPU nodes at Lyon and Nancy are supported by the latest GPU drivers. For Lille and Grenoble, you have to install the NVIDIA 340.xx Legacy drivers. The following table resumes the situation on Grid'5000 as of January 2016:
Site | Cluster | GPU | OAR property | Driver version | CUDA toolkit version |
---|---|---|---|---|---|
Lyon | orion (4 nodes) | Nvidia Tesla-M2075 (1 per node) | -p "GPU='YES'" | 346.xx (jessie), 352.xx | CUDA 7.0 (jessie), 7.5 |
Nancy | graphique (6 nodes) | Nvidia GTX 980 GPU (2 per node) | -p "GPU='YES'" -q production | 346.xx (jessie), 352.xx | CUDA 7.0 (jessie), 7.5 |
Grenoble | adonis (10 nodes) | Nvidia Tesla-C1060 (2 per node) | -p "GPU='YES'" | 340.xx | CUDA 6.5 |
Lille | chirloute (8 nodes) | Nvidia Tesla-S2050 (1 per node) | -p "GPU='SHARED'" | 340.xx | CUDA 6.5 |
Deployment
First, reserve a GPU node and deploy the jessie-x64-base
environment:
Once the deployment is terminated, you should be able to connect to node as root:
Downloading the NVIDIA toolkit
We will now install the NVIDIA drivers, compilers, librairies and examples. The complete CUDA distribution can be download from https://developer.nvidia.com/cuda-toolkit-archive%7Cthe official website] or from git.grid5000.fr/sources/. Select a toolkit version compatible with your GPU hardware: cd /tmp/; wget git.grid5000.fr/sources/cuda_7.5.18_linux.run cd /tmp/; wget git.grid5000.fr/sources/cuda_7.0.28_linux.run cd /tmp/; wget git.grid5000.fr/sources/cuda_6.5.14_linux_64.run # chirloute, adonis
When download is over, you can look at the installer options:
There is actually three distincts installers (for the drivers, compilers and examples) embedded on this file and you can extract them using:
It extracts 3 files. For example, with cuda_7.5.18_linux.run, you obtain:
- NVIDIA-Linux-x86_64-352.39.run: the drivers installer (version 352.39)
- cuda-linux64-rel-7.5.18-19867135.run: the CUDA toolkit installer (ie. compilers, librairies)
- cuda-samples-linux-7.5.18-19867135.run: the CUDA samples installer
Each installers provides a --help option.
Driver installation
To install the linux driver (ie. kernel module), we need the kernel header files and gcc 4.8 (as the module should be compiled with the same version of gcc used to compile the kernel in the first place):
apt-get -y update && apt-get -y upgrade
apt-get -y install make
apt-get -y install linux-headers-amd64 # it also installs gcc-4.8
To compile and install the kernel module, use:
cd /tmp/installers CC=gcc-4.8 sh NVIDIA-Linux-x86_<version>.run --accept-license --silent --no-install-compat32-libs # note: do not use --no-install-compat32-libs with CUDA 6.5
(warnings about X.Org can safely be ignored)
To install the CUDA toolkit, use:
You can add the CUDA toolkit to your current shell environment by using: export PATH=$PATH:/usr/local/cuda-<version>/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-<VERSION>/lib64
To make those environment variables permanent for future ssh sessions you can add them to ~/.bashrc. Alternatively, you can edit the default PATH in /etc/profile
and add a configuration file for the dynamic linker under /etc/ld.so.conf.d/
as follow:
![]() |
node :
|
echo -e "/usr/local/cuda-5.5/lib\n/usr/local/cuda-<version>/lib64" > /etc/ld.so.conf.d/cuda.conf |
You also need to run ldconfig
as root to update the linker configuration.
To check if NVIDIA drivers are correctly installed, you can use the nvidia-smi tool:
Here is an example of the result on adonis cluster:
root@adonis-2:~# nvidia-smi Wed Dec 4 14:42:08 2013 +------------------------------------------------------+ | NVIDIA-SMI 5.319.37 Driver Version: 319.37 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T10 Proce... Off | 0000:0A:00.0 N/A | N/A | | N/A 36C N/A N/A / N/A | 3MB / 4095MB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla T10 Proce... Off | 0000:0C:00.0 N/A | N/A | | N/A 36C N/A N/A / N/A | 3MB / 4095MB | N/A Default | +-------------------------------+----------------------+----------------------+
Then, you can compile and run the toolkit examples. You need the g++ compiler to do so: apt-get install g++ sh cuda-samples-linux-7.5.18-19867135.run -noprompt -prefix=/tmp/samples -cudaprefix=/usr/local/cuda-7.5/ cd /tmp/samples make -j8 See [#Compiling the CUDA Toolkit examples] for more information.
You can save your newly created environment with [TGZ-G5K|tgz-g5k]:
Intel Xeon Phi (MIC) on Grid'5000
Reserve a Xeon Phi at Nancy
As NVIDIA GPU, Phi coprocessor cards provides additional compute power and can be used to offload computations. As those extension cards run a modified Linux kernel, it is also possible to log in directly onto the Xeon Phi via ssh. Also, it is possible to compile application for the Xeon Phi processor (which is based on x86 technology) and runs it natively on the embedded Linux system of the Xeon Phi card. Xeon Phi [5] are available at Nancy.
To reserve a Grid'5000 node that includes a Xeon Phi, you can use this command:
Configuring the Intel compiler to use a license server
In order to compile programs for the Intel Xeon Phi, you have to use Intel compilers. Intel compilers are available in /grid5000/compilers/icc13.2/ at Nancy but they require a commercial (or academic) license and such licenses are not provided by Grid'5000. You might have access to a license server at your local laboratory. For instance, you have access to the Inria license server if you are on the Inria network. You can then use your machine as a bridge between the license server of your local network and your Grid'5000 nodes by creating an SSH tunnel. The procedure is explain below. Alternativly, you can compile your programs elsewhere and copy your executables on Grid'5000.
Note that [6] and [Clang|http://openmp.llvm.org/] also provides a limited support for newest Xeon Phi. See page for more information about Third Party Tools.
Using a license server
In the following, we will setup a SSH tunnel between a license server and a Grid'5000 node (graphite-X). The Intel compilers will be configured to use localhost:28618 as the license server and the SSH tunnel will forward connections from localhost:28618 to the license server (you can use any local port number for this). On the following, we use the Inria license server named jetons.inria.fr, ports 29030 and 34430.
On the Nancy frontend, create a license configuration file for the Intel compilers:
cat <<EOF >> ~/intel/licenses
SERVER localhost ANY 28618
USE_SERVER
EOF
Then, start an SSH tunnel:
The previous command open a shell session that can be used directly. You should keep it open as long as you need the Intel compilers.
You can also add the tunnel setup to your configuration file (.ssh/config):
Host g5k Hostname access.grid5000.fr
[...]
Host *.intel
User g5klogin
ForwardAgent no
RemoteForward *:28618 jetons.inria.fr:29030
RemoteForward *:34430 jetons.inria.fr:34430
ProxyCommand ssh g5k -W "$(basename %h .intel):%p"
Then, to create the tunnel and connect to your node, you can simply use:
To test the tunnel, you can do:
Using Intel compilers on Grid'5000 can be rather slow due to the license server connection.
Execution on Xeon Phi
An introduction to the Xeon Phi programming environment is available on the Intel website. Other useful resources include:
You can check the status of the MIC card using micinfo:
Offload mode
In offload mode, your program is executed on the Host, but part of its execution is offloaded to the co-processor card.
![]() |
Note |
---|---|
This section uses a code snippet from the Intel tutorial |
Compile some source code:
cd /tmp
And execute it:
Native mode
In native mode, your program is completely executed on the Xeon Phi. Your code must be compiled natively for the Xeon Phi architecture using the -mmic option:
You need to connect to the card using SSH first. Login on Phi:
And execute: