GPUs on Grid5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
 
(194 intermediate revisions by 20 users not shown)
Line 1: Line 1:
{{Maintainer|Simon Delamare}}
{{Maintainer|Emile Morel}}
{{Author|Elodie Bertoncello}}
{{Portal|User}}
{{Portal|User}}
{{Portal|HPC}}
{{Portal|Tutorial}}
{{Pages|HPC}}
{{TutorialHeader}}


= Purpose =
= Introduction =


This page presents usage of computing accelerators, such as GPU and Xeon Phi, on Grid'5000. You will learn to reserve these resources and execute code on them. If you prefer to deploy your own environment (for instance to install latest drivers), some guidelines are also provided.
This tutorial presents how to use GPU Accelerators. You will learn to reserve these resources, setup the environment and execute codes on the accelerators. Please note that this page is not about GPU programming and only focuses on the specificities of the Grid'5000 platform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is especially useful for testing the latest version of the accelerator software stack (such as the NVIDIA CUDA libraries).


Please note that this tutorial is not about GPU or Xeon Phi programming. Many documents are available on the Web on this subject.
In this tutorial, we provide code examples that use the Level-3 [https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms BLAS] function DGEMM to compute the product of the two matrices. BLAS libraries are available for a variety of computer architectures (including multicores and accelerators) and this code example is used on this tutorial as a toy benchmark to compare the performance of accelerators and/or available BLAS libraries.


For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the [[Getting Started]] tutorial first to get familiar with the platform (connections to the platform, resource reservations) and its basic concepts (job scheduling, environment deployment).
The [[Hardware#Accelerators (GPU, Xeon Phi)|Hardware page]] is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see [[Status]]).


Note that Intel Xeon Phi KNC (MICs) available in Nancy are no longer supported ([[Unmaintained:Intel Xeon Phi|documentation]] remains available)


= Pre-requisite =
= Nvidia GPU on Grid'5000 =


* A basic knowledge of Grid'5000 is required, we suggest you read the [[Getting Started]] tutorial first.
Note that NVIDIA drivers (see <code class="command">nvidia-smi</code>) and CUDA (<code class="command">nvcc --version</code>) compilation tools are installed by default on nodes.  
* Information about hardware accelerators availability can be found on [[Special:G5KHardware]].


= GPU on Grid'5000 =
== Choosing a GPU ==


In this section, we will first compile and use CUDA examples and in the second part, we will install NVIDIA drivers and compile CUDA 5 from a simple wheezy-x64-base environment.
Have a look at per-site, detailed hardware pages (for instance, at [[Lyon:Hardware#gemini|Lyon]]), you will find here useful informations about GPUs:
* the card model name (see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units to know more about each model)
* the GPU memory size available for computations
* for NVidia GPU, their compute capability
* the hosting node characteristics (#cpu, qty of memory available, #gpus, reservable local disk availability, ...)
* the job access conditions (ie: default or production queue, max walltime partition for clusters in the production queues)


=== About NVidia and CUDA compatibility with older GPUs ===


== Using CUDA ==
Most of GPU available in Grid'5000 are supported by Nvidia driver and CUDA delivered in Grid'5000 environments. As of October 2021, there are two exceptions:


=== Download and compile examples ===
* K40m GPUs available in ''grimani'' cluster in Nancy requires the <code>nvcc</code> option <code>---gpu-architecture=sm_35</code> (35 for compute capability <code>3.5</code>) to be used with CUDA starting from version 11, which is the version shipped with our debian11 environment.


GPU are available in:
* M2075 GPUs (compute capability 2.0) of the ''orion'' cluster in Lyon is not supported by the driver shipped in our environments. GPUs in this cluster are no more usable from our environments and the ''gpu'' property used to select a GPU node using oarsub (see below) is disabled.  Not that it is still possible for to [[#Custom_Nvidia_driver_using_deployment|build an environment with custom driver]] to use these cards.
* Grenoble (adonis)  
* Lyon (orion)
* Lille (chirloute)


NVIDIA drivers 346.22 (see `nvidia-smi`) and CUDA 7.0 (`nvcc --version`) compilation tools are installed by default on nodes.<br />
See https://en.wikipedia.org/wiki/CUDA#GPUs_supported to know more about the relationship between Cuda versions and compute capability.
You can reserve a node with GPU using OAR GPU property. For Grenoble and Lyon:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "GPU='YES'"}}
or for Lille:
{{Term|location=lille|cmd=<code class="command">oarsub</code> -I -p "GPU='SHARED'"}}


{{Warning|text=Please note that Lille GPU are shared between nodes and the oarsub command will differ from Grenoble or Lyon. When using GPUs at Lille, you may encounter some trouble, you can read more about Lille GPU on [[Lille:GPU]].}}
== Reserving GPUs ==


We will then download CUDA 7.0 samples and install them.
=== Single GPU ===
{{Term|location=node|cmd=cd /tmp; <code class="command">wget</code> http://git.grid5000.fr/sources/cuda-samples-linux-7.0.28-19326674.run}}
If you only need a single GPU in the standard environment, reservation is as simple as:
{{Term|location=node|cmd=<code class="command">sh</code> cuda-samples-linux-7.0.28-19326674.run -noprompt -prefix=/tmp/samples}}


{{Note|text=The samples are part of the [https://developer.nvidia.com/cuda-toolkit-70 CUDA Toolkit] and can be extracted from the toolkit installer using the ''--extract=/path'' option.}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-l "gpu=1"</code>}}


<br /><br />
{{Note|text=On a multi-GPU node, this will give you only part of the memory and CPU resources. For instance, on a dual-GPU node, reserving a single GPU will give you access to half of the system memory and half of the CPU cores. This ensures that another user can reserve the other GPU and still have access to enough system memory and CPU cores.}}
Then you can go to installation path, in our case <code class="file">/tmp/samples/</code>. If you list the directory, you will see a lot of folder starting with a number, these are CUDA examples. CUDA examples are described in the document named <code class="file">Samples.html</code>. You might also want to have a look to <code class="file">doc</code> directory or the [http://docs.nvidia.com/cuda/cuda-samples/index.html#getting-started-with-cuda-samples online documentation].<br /><br />
We will now compile examples, this will take a little time. From CUDA samples installation directory (<code class="file">/tmp/samples</code>), run make:
{{Term|location=node|cmd=<code class="command">cd /tmp/samples</code>}}
{{Term|location=node|cmd=<code class="command">make -j8</code>}}


The process is complete when "Finished building CUDA samples" is printed. You should be able to run CUDA examples. We will try the one named <code class="file">Device Query</code> which is located in <code class="file">/tmp/samples/1_Utilities/deviceQuery/</code>. This sample enumerates the properties of the CUDA devices present in the system.
In Nancy, you have to use the production queue for most of the GPU clusters, for instance:
{{Term|location=node|cmd=<code class="command">/tmp/samples/1_Utilities/deviceQuery/deviceQuery</code>}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-q production</code> <code class="command">-l "gpu=1"</code>}}


This is an example of the result on adonis cluster at Grenoble:
If you require several GPUs for the same experiment (e.g. for inter-GPU communication or to distribute computation), you can reserve multiple GPUs of a single node:
ebertoncello@adonis-2:/tmp/samples$ ./1_Utilities/deviceQuery/deviceQuery
 
1_Utilities/deviceQuery/deviceQuery Starting...
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-l</code> <code class="replace">host=1/gpu=2</code>}}
 
CUDA Device Query (Runtime API) version (CUDART static linking)
{{Note|text=When you run <code class="command">nvidia-smi</code>, you will only see the GPU(s) you reserved, even if the node has more GPUs. This is the expected behaviour.}}
 
Detected 2 CUDA Capable device(s)
To select a specific model of GPU, two possibilities:
 
Device 0: "Tesla T10 Processor"
'''use gpu model aliases, as describe in [[OAR Syntax simplification#GPUs]], e.g.'''
  CUDA Driver Version / Runtime Version          5.0 / 5.0
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-l gpu=1</code> <code class="command">-p </code><code class="replace">gpu_alias</code>}}
  CUDA Capability Major/Minor version number:    1.3
 
  Total amount of global memory:                4096 MBytes (4294770688 bytes)
'''use the "gpu_model" property, e.g.'''
  (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-l gpu=1</code> <code class="command">-p "gpu_model =</code> '<code class="replace">GPU model</code>'"}}
  GPU Clock rate:                                1296 MHz (1.30 GHz)
 
  Memory Clock rate:                            800 Mhz
The exact list of GPU models is available on the [[OAR_Properties#gpu_model|OAR properties page]], and you can use [[Hardware#Accelerators_.28GPU.2C_Xeon_Phi.29|Hardware page]] to have an overview of available GPUs on each site.
  Memory Bus Width:                              512-bit
 
  Max Texture Dimension Size (x,y,z)            1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
=== Reserving full nodes with GPUs ===
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
 
  Total amount of constant memory:              65536 bytes
In some cases, you may want to reserve a complete node with all its GPUs. This allows you to customize the software environment with [[Sudo-g5k]] or even to [[Getting_Started|deploy another operating system]].
  Total amount of shared memory per block:      16384 bytes
 
  Total number of registers available per block: 16384
To make sure you obtain a node with a GPU, you can use the "gpu_count" property:
  Warp size:                                    32
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-p "gpu_count > 0"</code>}}
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:          512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:    65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:          12 / 0
  Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.0 / 5.0
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                4096 MBytes (4294770688 bytes)
  (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                            800 Mhz
  Memory Bus Width:                              512-bit
  Max Texture Dimension Size (x,y,z)            1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:              65536 bytes
  Total amount of shared memory per block:      16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                    32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:          512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:    65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:      Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:          10 / 0
  Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 2, Device0 = Tesla T10 Processor, Device1 = Tesla T10 Processor


=== Install CUDA from a base environement ===
In Nancy, you have to use the production queue for most GPU clusters:
{{Term|location=nancy|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-q production</code> <code class="command">-p "gpu_count > 0"</code> }}


==== Reservation and deployment ====
To select a specific model of GPU, you can also use the "gpu_model" property, e.g.
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-p "gpu_model =</code> '<code class="replace">GPU model</code>'"}}


We will now install up to date NVIDIA drivers and up to date CUDA (5.5).
If you want to deploy an environment on the node, you should add the <code class="command">-t deploy</code> option.
First reserve a node and then deploy <code class="file">wheezy-x64-base</code>.
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "GPU='<code class="replace">YES</code>'" -l /nodes=1,walltime=2}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e wheezy-x64-base -k}}
<br />Once the deployment is terminated, you should be able to connect to node and download CUDA 5.5 installer to /tmp.
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}
{{Term|location=node|cmd=<code class="command">wget</code> http://developer.download.nvidia.com/compute/cuda/5_5/rel/installers/cuda_5.5.22_linux_64.run -P /tmp/}}
<br />When download is over, you can see installation options from CUDA installer
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_5.5.22_linux_64.run --help}}


root@adonis-2:/tmp# sh cuda_5.5.22_linux_64.run --help
=== Note about AMD GPU ===
Options:
    -help                      : Print help message
    -driver                    : Install NVIDIA Display Driver
    -uninstall                : Uninstall NVIDIA Display Driver
    -toolkit                  : Install CUDA 5.5 Toolkit (default: /usr/local/cuda-5.5)
    -toolkitpath=<PATH>        : Specify a custom path for CUDA location
    -samples                  : Install CUDA 5.5 Samples (default: /usr/local/cuda-5.5/samples)
    -samplespath=<PATH>        : Specify a custom path for Samples location
    -silent                    : Run in silent mode. Implies acceptance of the EULA
    -verbose                  : Run in verbose mode
    -extract=<PATH>            : Extract individual installers from the .run file to PATH
    -optimus                  : Install driver support for Optimus
    -override                  : Overrides the installation checks (compiler, lib, etc)
    -kernel-source-path=<PATH> : Points to a non-default kernel source location
    -tmpdir <PATH>            : Use <PATH> as temporary directory - useful when /tmp is noexec


We can extract all installers from this one :
As of October 2021, AMD GPUs are available in a single Grid'5000 cluster, [[Lyon:Hardware#neowise|neowise]], in Lyon. <code class="command">oarsub</code> commands shown above could give you either NVidia or AMD GPUs. The ''gpu_model'' property may be used to filter between GPU vendors. For instance:
{{Term|location=node|cmd=<code class="command">sh</code> /tmp/cuda_5.5.22_linux_64.run -extract=/tmp && <code class="command">cd</code> /tmp}}
<br />This will extract 3 installers:
* NVIDIA-Linux-x86_64-319.37.run: NVIDIA drivers
* cuda-linux64-rel-5.5.22-16488124.run: CUDA installer
* cuda-samples-linux-5.5.22-16488124.run: CUDA samples installer
<br /> Each of them also have installation options :
{{Term|location=node|cmd=<code class="command">sh</code> NVIDIA-Linux-x86_64-319.37.run --help}}
{{Term|location=node|cmd=<code class="command">sh</code> cuda-linux64-rel-5.5.22-16488124.run --help}}
{{Term|location=node|cmd=<code class="command">sh</code> cuda-samples-linux-5.5.22-16488124.run --help}}


==== Installation ====
{{Term|location=frontend|cmd=<code class="command">oarsub</code> <code class="command">-I</code> <code class="command">-p "gpu_count > 0 AND gpu_model NOT LIKE 'Radeon%'"</code>}}


We are ready to install NVIDIA drivers and CUDA 5.5. We will also use gcc-4.6 which is needed for this installation.
will filter out Radeon GPUs (=AMD GPUs). See [[#AMD GPU on Grid'5000|below]] for more information about AMD GPUs.
{{Term|location=node|cmd=CC=/usr/bin/gcc-4.6 <code class="command">sh</code> NVIDIA-Linux-x86_64-319.37.run --accept-license --silent --disable-nouveau -X --kernel-name=`uname -r`}}
{{Term|location=node|cmd=CC=/usr/bin/gcc-4.6 <code class="command">sh</code> cuda-linux64-rel-5.5.22-16488124.run -noprompt}}
<br /> NVIDIA drivers and CUDA 5.5 are successfully installed, but some more configuration is needed such as add <code class="file">/usr/local/cuda-5.5/bin</code> to <code class="file">$PATH</code> and add <code class="file">/usr/local/cuda-5.5/lib64</code> and <code class="file">/usr/local/cuda-5.5/lib</code> to <code class="file">$LD_LIBRARY_PATH</code>.
{{Term|location=node|cmd=<code class="command">export</code> PATH=$PATH:/usr/local/cuda-5.5/bin}}
{{Term|location=node|cmd=<code class="command">export</code> LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64:/usr/local/cuda-5.5/lib:$LD_LIBRARY_PATH}}
But this is not permanent after a disconnection from your ssh session, this changes will be lost. To make this permanent, you have to edit <code class="file">/etc/profile</code> and create a file under <code class="file">/etc/ld.so.conf.d/</code> directory
{{Term|location=node|cmd=<code class="command">sed</code> -e "s/:\/bin/:\/bin:\/usr\/local\/cuda-5.5\/bin/" -i /etc/profile}}
{{Term|location=node|cmd=<code class="command">echo</code> -e "/usr/local/cuda-5.5/lib\n/usr/local/cuda-5.5/lib64" > /etc/ld.so.conf.d/cuda.conf && <code class="command">ldconfig</code>}}
<br />We almost done the installation. The last step have to add a script which will enable GPU and execute it at start-up.<br />
* Create a file named <code class="file">/usr/local/bin/enable-gpu</code> and copy the following script in it.


<source lang="bash">#!/bin/bash
== GPU usage tutorial ==


if /sbin/modprobe nvidia; then
In this section, we will give an example of GPU usage under Grid'5000.
  if [ "$?" -eq 0 ]; then
    # Count the number of NVIDIA controllers found.
    NVDEVS=`lspci | grep -i NVIDIA`
    N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
    NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`


    N=`expr $N3D + $NVGA - 1`
Every steps of this tutorial must be performed on a Nvidia GPU node.
    for i in `seq 0 $N`; do
      mknod -m 666 /dev/nvidia$i c 195 $i
    done


    mknod -m 666 /dev/nvidiactl c 195 255
=== Run the CUDA Toolkit examples ===


  else
In this part, we are going compile and execute CUDA examples provided by Nvidia using CUDA Toolkit available on the default (standart) environment.  
    exit 1
  fi
fi</source>
* Give execution rights to the script file:
{{Term|location=node|cmd=<code class="command">chmod</code> +x /usr/local/bin/enable-gpu}}
* Finally, add a line in the /etc/rc.local file to execute your script at boot time.
{{Term|location=node|cmd=<code class="command">sed</code> -e "s/exit 0//" -i /etc/rc.local; <code class="command">echo</code> -e "#Enable GPUs at boot time\nsh /usr/local/bin/enable-gpu\n\nexit 0" >> /etc/rc.local}}


First, we retrieve the version of CUDA installed on the node:


==== Check installation ====
<pre>
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
</pre>


To check if NVIDIA drivers are correctly installed, you can use nvidia-smi tool.
Version is 11.2. We are going to download the corresponding CUDA samples.
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}}
This is an example of the result on adonis cluster:
root@adonis-2:~# nvidia-smi
Wed Dec  4 14:42:08 2013     
+------------------------------------------------------+                     
| NVIDIA-SMI 5.319.37  Driver Version: 319.37        |                     
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|  0  Tesla T10 Proce...  Off  | 0000:0A:00.0    N/A |                  N/A |
| N/A  36C  N/A    N/A /  N/A |        3MB /  4095MB |    N/A      Default |
+-------------------------------+----------------------+----------------------+
|  1  Tesla T10 Proce...  Off  | 0000:0C:00.0    N/A |                  N/A |
| N/A  36C  N/A    N/A /  N/A |        3MB /  4095MB |    N/A      Default |
+-------------------------------+----------------------+----------------------+
<br />To check CUDA installation we will compile CUDA examples.
* Install dependencies
{{Term|location=node|cmd=<code class="command">apt-get</code> update && <code class="command">apt-get</code> -y install freeglut3-dev libxmu-dev libxi-dev}}
* Uncompress samples and compile them
{{Term|location=node|cmd=<code class="command">sh </code> cuda-samples-linux-5.5.22-16488124.run -cudaprefix=/usr/local/cuda-5.5}}
You will be prompt to accept the EULA and for a installation path. In this example we will install examples in <code class="file">/usr/local/cuda-5.5/samples</code>
* To compile CUDA samples, go to installation directory and then type make.
* Once compilation is over, you should be able to run some samples:
{{Term|location=node|cmd=<code class="command">/usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery</code>}}
root@adonis-7:/tmp# /usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-5.5/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                4096 MBytes (4294770688 bytes)
  (30) Multiprocessors, (  8) CUDA Cores/MP:    240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                            800 Mhz
  Memory Bus Width:                              512-bit
  Maximum Texture Dimension Size (x,y,z)        1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:              65536 bytes
  Total amount of shared memory per block:      16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                    32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:          512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:      Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:          12 / 0
  Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla T10 Processor"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                4096 MBytes (4294770688 bytes)
  (30) Multiprocessors, (  8) CUDA Cores/MP:    240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                            800 Mhz
  Memory Bus Width:                              512-bit
  Maximum Texture Dimension Size (x,y,z)        1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:              65536 bytes
  Total amount of shared memory per block:      16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                    32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:          512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:      Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:          10 / 0
  Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 2, Device0 = Tesla T10 Processor, Device1 = Tesla T10 Processor
Result = PASS
Please note that CUDA samples are described on the document named Samples.html located on samples installation directory.
<br />Do not forget to backup your environment:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_FILE_NODE` tgz-g5k > <code class="replace">myimagewithcuda</code>.tgz}}


== Using OpenCL ==
<pre>
{{Warning|text=This part is still a Work in Progress}}
cd /tmp
git clone --depth 1 --branch v11.2 https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples
</pre>


= Intel Xeon Phi (MIC) on Grid'5000 =
You can compile all the examples at once by running make:


== Reserve a Phi on Nancy site ==
<pre>
make -j8
</pre>


Xeon Phi are extension boards embedded in regular compute nodes. To reserve a Grid'5000 node that includes a Xeon Phi, use this command:
The compilation of all the examples is over when "Finished building CUDA samples" is printed.   
   
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "MIC='YES'" -t allow_classic_ssh -t mic}}


== Configuring the Intel compiler to use a licenses server ==
Each example is available from its own directory, under <code class="file">Samples</code> root directory (it can also be compiled separately from there).


In order to run programs on Intel Xeon Phi, you should compile them using Intel compilers. An instance is available on /grid5000/compilers/icc13.2/ on Nancy's site.  
You can first try the <code class="file">Device Query</code> example located in <code class="file">Samples/deviceQuery/</code>. It enumerates the properties of the CUDA devices present in the system.


Intel compilers require commercial license that are not provided by Grid'5000. To use them, you need an access to a token (licenses) server that provides this license. Then, you will use a machine (your laptop for instance) as a bridge between the license server and your Grid'5000 nodes.
<pre>
/tmp/cuda-samples/Samples/deviceQuery/deviceQuery
</pre>


=== Using licenses server ===
Here is an example of the result on the chifflet cluster at Lille:  
Create your licenses file (you may need a different the port number):  


  {{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/intel}}
<pre>
/tmp/cuda-samples/Samples/deviceQuery/deviceQuery Starting...


  cat <<EOF >> ~/intel/licenses
  CUDA Device Query (Runtime API) version (CUDART static linking)
SERVER localhost ANY <code class="replace">28618</code>
USE_SERVER
EOF


Then, create an SSH tunnel :
Detected 2 CUDA Capable device(s)


{| width="100%"
Device 0: "GeForce GTX 1080 Ti"
|-
  CUDA Driver Version / Runtime Version          11.2 / 11.2
| width="50%" |
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                            5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                2883584 bytes
  Maximum Texture Dimension Size (x,y,z)        1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:              65536 bytes
  Total amount of shared memory per block:      49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                    32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:          1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:      Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:  0 / 4 / 0
  Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >


from the command line:
Device 1: "GeForce GTX 1080 Ti"
{{Term|location=laptop|cmd=<code class="command">ssh</code> -R 28618:<code class="replace">''LICENSE_SERVER''</code>:28618 -R 28619:<code class="replace">''LICENSE_SERVER''</code>:28619 graphite-<code class="replace">X</code>.nancy.g5k}}
  CUDA Driver Version / Runtime Version          11.2 / 11.2
||
  CUDA Capability Major/Minor version number:    6.1
| width="50%" |
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
or from the SSH configuration (see [https://www.grid5000.fr/mediawiki/index.php/SSH#Setting_up_a_user_config_file]):
  (28) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                            5505 Mhz
  Memory Bus Width:                             352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)        1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:              65536 bytes
  Total amount of shared memory per block:      49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                    32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:          1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                            512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                    No
  Integrated GPU sharing Host Memory:           No
  Support host page-locked memory mapping:      Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):     Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:  0 / 130 / 0
  Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : No
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : No


Host g5k
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.2, NumDevs = 2
  Hostname access.grid5000.fr
Result = PASS
  ...
</pre>
Host *.intel
  User <code class="replace">g5kuser</code>
  ForwardAgent no
  RemoteForward *:28618 <code class="replace">''LICENSE_SERVER''</code>:28618
  RemoteForward *:28619 <code class="replace">''LICENSE_SERVER''</code>:28619
  ProxyCommand ssh g5k "nc -q 0 `basename %h .intel` %p"


=== BLAS examples ===


Then connect to your node:
We now run our BLAS example to illustrate GPU performance for dense matrix multiply.  
{{Term|location=laptop|cmd=<code class="command">ssh</code> graphite-<code class="replace">X</code>.nancy.intel}}
|}


=== Using the Inria licenses server (if you are on an Inria network) ===
The toolkit provides the [https://developer.nvidia.com/cublas CUBLAS] library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [http://docs.nvidia.com/cuda/cublas/index.html here] and several [http://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries advanced examples] using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...).


The Inria license server is '''jetons.inria.fr''', ports '''29030''' and '''34430'''.
The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides [http://docs.nvidia.com/cuda/nvblas/ NVBLAS], a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be [https://man7.org/linux/man-pages/man8/ld.so.8.html forcibly linked] using the LD_PRELOAD environment variable.


Create your licenses file:  
To test NVBLAS, you can download and compile our matrix-matrix multiplication example:
{{Term|location=node|cmd=<code class="command">wget</code> http://apt.grid5000.fr/tutorial/gpu/matmatmul.c}}
{{Term|location=node|cmd=gcc -O3 -Wall -std=c99 matmatmul.c -o matmatmul -lblas}}


  {{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/intel}}  
You can first check the performance of the BLAS library on the CPU. For small matrix size (<5000), the provided example will compare the BLAS implementation to a naive jki-loop version of the matrix multiplication:
{{Term|location=node|cmd=./matmatmul 2000}}
  Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000)
  BLAS  - Time elapsed:  1.724E+00 sec.
  J,K,I - Time elapsed:  7.233E+00 sec.


cat <<EOF >> ~/intel/licenses
To offload the BLAS computation on the GPU, use:
SERVER localhost ANY <code class="replace">28618</code>
{{Term|location=node|cmd=echo "NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libblas.so" > nvblas.conf}}
  USE_SERVER
{{Term|location=node|cmd=LD_PRELOAD=libnvblas.so ./matmatmul 2000}}
  EOF
  [NVBLAS] Config parsed
  Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000)
  BLAS - Time elapsed: 1.249E-01 sec.


Then, create an SSH tunnel :
Depending on node hardware, GPU might perform better on larger problems:
{{Term|location=node|cmd=./matmatmul 5000}}
  Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
  BLAS  - Time elapsed:  2.673E+01 sec.


{| width="100%"
{{Term|location=node|cmd=LD_PRELOAD=libnvblas.so ./matmatmul 5000}} 
|-
  [NVBLAS] Config parsed
| width="50%" |
  Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
  BLAS  - Time elapsed:  1.718E+00 sec.


from the command line:
If you want to measure the time spent on data transfers to the GPU, have a look to the simpleCUBLAS (<code class="file">/tmp/cuda-samples/Samples/simpleCUBLAS</code>) example and instrument the code with timers.
{{Term|location=laptop|cmd=<code class="command">ssh</code> -R 28618:jetons.inria.fr:29030 -R 34430:jetons.inria.fr:34430 graphite-<code class="replace">X</code>.nancy.g5k}}
||
| width="50%" |
or from the SSH configuration (see [https://www.grid5000.fr/mediawiki/index.php/SSH#Setting_up_a_user_config_file]):


Host g5k
== Custom CUDA version or Nvidia drivers ==
  Hostname access.grid5000.fr
  ...
Host *.intel
  User <code class="replace">g5kuser</code>
  ForwardAgent no
  RemoteForward *:28618 jetons.inria.fr:29030
  RemoteForward *:34430 jetons.inria.fr:34430
  ProxyCommand ssh g5k "nc -q 0 `basename %h .intel` %p"


Here, we explain how to use other CUDA versions with [[Modules]], use Nvidia Docker images and install the NVIDIA drivers and compilers before validating the installation on the previous example set.


Then connect to your node:
=== Older or newer CUDA version using modules ===
{{Term|location=laptop|cmd=<code class="command">ssh</code> graphite-<code class="replace">X</code>.nancy.intel}}
|}


== Execution on Xeon Phi ==
Different CUDA versions can be loaded using "module" command. You should first choose the CUDA toolkit version that you will load with module tool:
An introduction to the Phi programming environment is available [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment on the Intel website].


Other resources:
{{Term|location=node|cmd= module av cuda}}
* [http://spscicomp.org/wordpress/pages/the-intel-xeon-phi/ The IBM HPC Systems Scientific Computing User Group]
<pre>
* [http://www.hpc.cineca.it/content/quick-guide-intel-mic-usage CINECA/SCAI].


=== Offload mode ===
------------- /grid5000/spack/v1/share/spack/modules/linux-debian11-x86_64_v2 ----------------
  cuda/11.4.0_gcc-10.4.0    cuda/11.6.2_gcc-10.4.0    cuda/11.7.1_gcc-10.4.0 (D)


In offload mode, your program is started on the node, but part of its execution is offloaded to the Phi.
</pre>


Compile some source code:
{{Term|location=node|cmd= module load cuda/11.6.2_gcc-10.4.0}}


{{Term|location=graphite|cmd=<code class="command">source</code> /opt/intel/composerxe/bin/compilervars.sh intel64}}
{{Term|location=node|cmd= nvcc --version}}
{{Term|location=graphite|cmd=<code class="command">icpc</code> -openmp /grid5000/xeonphi/samples/reduction.cpp -o reduction-offload}}
<pre>
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
</pre>


And execute it:
You should consult [https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver CUDA Toolkit and Compatible Driver Versions] to ensure compatibility with a specific Cuda version and the Nvidia GPU driver (for instance, Cuda 11.x toolkit requires a driver version >= 450.80.02)


{{Term|location=graphite|cmd=<code class="command">./reduction-offload</code>}}
=== Copy and compile the sample examples ===


{{Note|text=This section uses a code snippet from the [http://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment Intel tutorial]}}
You now have everything installed. For instance, you can compile and run the toolkit examples (see [[#Compiling the CUDA Toolkit examples]] for more information).


=== Native mode ===
You will need to override the CUDA path variable, and also load the matching compiler version from modules:


In native mode, your program is directly executed inside the Phi. You need to connect to the Phi using SSH first.
{{Term|location=node|cmd=which nvcc}}
<pre>/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e/bin/nvcc</pre>


Compile some source code:
{{Term|location=node|cmd=export CUDA_PATH=/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e}}
{{Term|location=node|cmd=module load gcc/10.4.0_gcc-10.4.0}}


{{Term|location=graphite|cmd=<code class="command">source</code> /grid5000/compilers/icc13.2/bin/compilervars.sh intel64}}
And then you can build and run the examples:
{{Term|location=graphite|cmd=<code class="command">icpc</code> /grid5000/xeonphi/samples/hello.cpp -openmp -mmic -o hello-native}}


Login on Phi:
{{Term|location=node|cmd=git clone --depth 1 --branch v11.6 https://github.com/NVIDIA/cuda-samples.git /tmp/cuda-samples}}
{{Term|location=node|cmd=cd /tmp/cuda-samples}}
{{Term|location=node|cmd=make -j32}}
{{Term|location=node|cmd=./Samples/0_Introduction/matrixMul/matrixMull}}


{{Term|location=graphite|cmd=<code class="command">ssh </code> mic0}}
{{Note|text=Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example}}


{{Note|text=Your home directory is available from inside the mic (the /grid5000 directory as well)}}
=== Nvidia-docker ===
A script to install nvidia-docker is available if you want to use Nvidia's images builded for Docker and GPU nodes. This provides an alternative way of making CUDA and Nvidia libraries available to the node. See [[Docker#Nvidia-docker|Nvidia Docker page]].


And execute:
=== Custom Nvidia driver using deployment ===
{{Term|location=graphite-mic0|cmd=<code class="command"> source</code> /grid5000/xeonphi/micenv}}
{{Term|location=grahite-mic0|cmd=<code class="command"> ./hello-native</code>}}


== Use Xeon Phi from a "min" deployed environment ==
A custom Nvidia driver may be installed on a node if needed. As ''root'' privileges are required, we will use kadepoy to deploy a <code class="file">debian11-x64-nfs</code> environment on the GPU node you reserved.


=== Reserve and deploy a node with MIC ===
This environment allows you to connect either as ''root'' (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:


{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -p "MIC='YES'" -t deploy}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -p "gpu_count > 0" -l /nodes=1,walltime=2}}  
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -k -e wheezy-x64-min}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODE_FILE -e debian11-x64-nfs -k}}  


You can now log on the node:
Once the deployment is terminated, you should be able to connect to the node as root:
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> root@`head -1 $OAR_NODE_FILE`}}


=== Install Xeon Phi drivers ===
You can then perform the NVIDIA driver installation:
 
{{Term|location=node|cmd=apt-get -y install linux-headers-amd64 make g++}}
 
{{Term|location=node|cmd=<code class="command">wget https://download.nvidia.com/XFree86/Linux-x86_64/470.82.01/NVIDIA-Linux-x86_64-470.82.01.run</code>}}
 
{{Term|location=node|cmd=<code class="command">rmmod nouveau</code>}}
 
{{Term|location=node|cmd=<code class="command">sh NVIDIA-Linux-x86_64-470.82.01.run -s --no-install-compat32-libs</code>}}
(warnings about X.Org can safely be ignored)
 
On the node you can check which NVIDIA drivers are installed with the <code class="command">nvidia-smi</code> tool:
 
{{Term|location=node|cmd=<code class="command">nvidia-smi</code>}}


MPSS drivers are available for wheezy on grid5000 debian repository, you just need to uncomment this line to /etc/apt/sources.list:
Here is an example of the result on the graphique cluster:
deb http://apt.grid5000.fr/debian sid main


Get grid5000 keyring:
<pre>
  {{Term|location=node|cmd=<code class="command">apt-get</code> update && apt-get install grid5000-keyring -y --force-yes}}
root@graphique-4:~# nvidia-smi
Install it:
Tue Jun 27 19:37:15 2023     
  {{Term|location=node|cmd=<code class="command">apt-get</code> install mpss-modules-3.2.0-4-amd64 mpss-micmgmt mpss-miccheck mpss-coi mpss-mpm mpss-metadata mpss-miccheck-bin glibc2.12.2pkg-libsettings0 glibc2.12.2pkg-libmicmgmt0 libscif0 mpss-daemon mpss-boot-files mpss-sdk-k1om intel-composerxe-compat-k1om g++ -y --force-yes}}
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util Compute M. |
|                              |                      |              MIG M. |
|===============================+======================+======================|
|  0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 26%  28C    P0    46W / 180W |      0MiB /  4043MiB |      0%      Default |
|                              |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|  1  NVIDIA GeForce ...  Off  | 00000000:82:00.0 Off |                  N/A |
| 28%  27C    P0    43W / 180W |      0MiB /  4043MiB |      2%      Default |
|                              |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                             
+-----------------------------------------------------------------------------+
| Processes:                                                                 |
| GPU  GI  CI        PID  Type  Process name                  GPU Memory |
|        ID  ID                                                  Usage      |
|=============================================================================|
|  No running processes found                                                |
+-----------------------------------------------------------------------------+


=== Copy configuration files ===
</pre>


Configuration are available on nfs server, so we should mount a partition to retrieve configurations files.
If you want to record your environment with the custom NVidia driver, see [[Advanced_Kadeploy#Create_a_new_environment_from_a_customized_environment]]
{{Term|location=node|cmd=<code class="command">apt-get</code> install nfs-common -y}}
{{Term|location=node|cmd=<code class="command">mount</code> nfs:/export/grid5000 /grid5000/}}
Mount automaticaly <code class="file">/grid5000</code> at boot: append the following line to <code class="file">/etc/fstab</code>
  nfs:/export/grid5000/ /grid5000/ nfs defaults 0 0


After that we can copy configuration files:
= AMD GPU on Grid'5000 =
{{Term|location=node|cmd=<code class="command">cp</code> /grid5000/xeonphi/conf/default.conf /etc/mpss/}}
{{Term|location=node|cmd=<code class="command">cp</code> /grid5000/xeonphi/conf/mic0.conf /etc/mpss/}}
{{Term|location=node|cmd=<code class="command">cp</code> /grid5000/xeonphi/conf/mpss /etc/init.d/}}
{{Term|location=node|cmd=<code class="command">cp</code> /grid5000/xeonphi/conf/interfaces /etc/network}}


Start mpss on boot:
As of October 2021, Grid'5000 has one cluster with AMD GPU: [[Lyon:Hardware#neowise|neowise cluster in Lyon]].
{{Term|location=node|cmd=<code class="command">update-rc.d</code> mpss defaults}}


Add mic module on boot:
A neowise GPU may be reserved using:
{{Term|location=node|cmd=<code class="command">echo</code> mic >> /etc/modules}}


If you want to mount /home and /grid5000 on mic you should:
{{Term|location=flyon|cmd=<code class="command">oarsub</code> -t exotic -p neowise -l gpu=1 -I}}


Add new file on /var/mpss/mic0/etc/fstab
A full neowise node may be reserved using:
nfs:/export/home /home nfs rsize=8192,wsize=8192,nolock,intr 0 0
nfs:/export/grid5000 /grid5000 nfs rsize=8192,wsize=8192,nolock,intr 0 0


And add this line on /var/mpss/mic0.filelist
{{Term|location=flyon|cmd=<code class="command">oarsub</code> -t exotic -p neowise -I}}
file /etc/fstab etc/fstab 644 0 0
dir /grid5000 755 0 0


Now we can reboot the machine:
The default environment on neowise include part of AMD's [https://rocmdocs.amd.com/en/latest/index.html ''ROCm''] stack with AMD GPU driver and basic tools and libraries such as:
{{Term|location=node|cmd=<code class="command">reboot</code>}}
* <code class=command>rocm-smi</code> : get information about GPUs
* <code class=command>hipcc</code> : HIP compiler
* <code class=command>hipfy-perl</code> : CUDA to HIP code converter


After reboot you can use MIC as you want. See [[Accelerators on Grid5000#Configuring the Intel compiler to use a licenses server|above]] for more details
In addition, most libraries and development tools from ROCm and HIP (available at https://rocmdocs.amd.com/en/latest/Installation_Guide/Software-Stack-for-AMD-GPU.html) are available as [[Modules|modules]]. Deep Learning Frameworks pytorch and TensorFlow are also [[Deep Learning Frameworks#Deep_learning_with_AMD_GPUs|known to work]].


{{Note|text= On a freshly installed node, don't forget to recreate a <code class=file>~/intel/licenses</code> file, except if you manually mounted your frontend home on your node home}}
{{Pages|HPC}}

Latest revision as of 09:46, 29 June 2023

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Introduction

This tutorial presents how to use GPU Accelerators. You will learn to reserve these resources, setup the environment and execute codes on the accelerators. Please note that this page is not about GPU programming and only focuses on the specificities of the Grid'5000 platform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is especially useful for testing the latest version of the accelerator software stack (such as the NVIDIA CUDA libraries).

In this tutorial, we provide code examples that use the Level-3 BLAS function DGEMM to compute the product of the two matrices. BLAS libraries are available for a variety of computer architectures (including multicores and accelerators) and this code example is used on this tutorial as a toy benchmark to compare the performance of accelerators and/or available BLAS libraries.

For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the Getting Started tutorial first to get familiar with the platform (connections to the platform, resource reservations) and its basic concepts (job scheduling, environment deployment). The Hardware page is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see Status).

Note that Intel Xeon Phi KNC (MICs) available in Nancy are no longer supported (documentation remains available)

Nvidia GPU on Grid'5000

Note that NVIDIA drivers (see nvidia-smi) and CUDA (nvcc --version) compilation tools are installed by default on nodes.

Choosing a GPU

Have a look at per-site, detailed hardware pages (for instance, at Lyon), you will find here useful informations about GPUs:

  • the card model name (see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units to know more about each model)
  • the GPU memory size available for computations
  • for NVidia GPU, their compute capability
  • the hosting node characteristics (#cpu, qty of memory available, #gpus, reservable local disk availability, ...)
  • the job access conditions (ie: default or production queue, max walltime partition for clusters in the production queues)

About NVidia and CUDA compatibility with older GPUs

Most of GPU available in Grid'5000 are supported by Nvidia driver and CUDA delivered in Grid'5000 environments. As of October 2021, there are two exceptions:

  • K40m GPUs available in grimani cluster in Nancy requires the nvcc option ---gpu-architecture=sm_35 (35 for compute capability 3.5) to be used with CUDA starting from version 11, which is the version shipped with our debian11 environment.
  • M2075 GPUs (compute capability 2.0) of the orion cluster in Lyon is not supported by the driver shipped in our environments. GPUs in this cluster are no more usable from our environments and the gpu property used to select a GPU node using oarsub (see below) is disabled. Not that it is still possible for to build an environment with custom driver to use these cards.

See https://en.wikipedia.org/wiki/CUDA#GPUs_supported to know more about the relationship between Cuda versions and compute capability.

Reserving GPUs

Single GPU

If you only need a single GPU in the standard environment, reservation is as simple as:

Terminal.png frontend:
oarsub -I -l "gpu=1"
Note.png Note

On a multi-GPU node, this will give you only part of the memory and CPU resources. For instance, on a dual-GPU node, reserving a single GPU will give you access to half of the system memory and half of the CPU cores. This ensures that another user can reserve the other GPU and still have access to enough system memory and CPU cores.

In Nancy, you have to use the production queue for most of the GPU clusters, for instance:

Terminal.png frontend:
oarsub -I -q production -l "gpu=1"

If you require several GPUs for the same experiment (e.g. for inter-GPU communication or to distribute computation), you can reserve multiple GPUs of a single node:

Terminal.png frontend:
oarsub -I -l host=1/gpu=2
Note.png Note

When you run nvidia-smi, you will only see the GPU(s) you reserved, even if the node has more GPUs. This is the expected behaviour.

To select a specific model of GPU, two possibilities:

use gpu model aliases, as describe in OAR Syntax simplification#GPUs, e.g.

Terminal.png frontend:
oarsub -I -l gpu=1 -p gpu_alias

use the "gpu_model" property, e.g.

Terminal.png frontend:
oarsub -I -l gpu=1 -p "gpu_model = 'GPU model'"

The exact list of GPU models is available on the OAR properties page, and you can use Hardware page to have an overview of available GPUs on each site.

Reserving full nodes with GPUs

In some cases, you may want to reserve a complete node with all its GPUs. This allows you to customize the software environment with Sudo-g5k or even to deploy another operating system.

To make sure you obtain a node with a GPU, you can use the "gpu_count" property:

Terminal.png frontend:
oarsub -I -p "gpu_count > 0"

In Nancy, you have to use the production queue for most GPU clusters:

Terminal.png nancy:
oarsub -I -q production -p "gpu_count > 0"

To select a specific model of GPU, you can also use the "gpu_model" property, e.g.

Terminal.png frontend:
oarsub -I -p "gpu_model = 'GPU model'"

If you want to deploy an environment on the node, you should add the -t deploy option.

Note about AMD GPU

As of October 2021, AMD GPUs are available in a single Grid'5000 cluster, neowise, in Lyon. oarsub commands shown above could give you either NVidia or AMD GPUs. The gpu_model property may be used to filter between GPU vendors. For instance:

Terminal.png frontend:
oarsub -I -p "gpu_count > 0 AND gpu_model NOT LIKE 'Radeon%'"

will filter out Radeon GPUs (=AMD GPUs). See below for more information about AMD GPUs.

GPU usage tutorial

In this section, we will give an example of GPU usage under Grid'5000.

Every steps of this tutorial must be performed on a Nvidia GPU node.

Run the CUDA Toolkit examples

In this part, we are going compile and execute CUDA examples provided by Nvidia using CUDA Toolkit available on the default (standart) environment.

First, we retrieve the version of CUDA installed on the node:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

Version is 11.2. We are going to download the corresponding CUDA samples.

cd /tmp
git clone --depth 1 --branch v11.2 https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples

You can compile all the examples at once by running make:

make -j8

The compilation of all the examples is over when "Finished building CUDA samples" is printed.

Each example is available from its own directory, under Samples root directory (it can also be compiled separately from there).

You can first try the Device Query example located in Samples/deviceQuery/. It enumerates the properties of the CUDA devices present in the system.

/tmp/cuda-samples/Samples/deviceQuery/deviceQuery

Here is an example of the result on the chifflet cluster at Lille:

/tmp/cuda-samples/Samples/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.2 / 11.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.2 / 11.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 130 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : No
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.2, NumDevs = 2
Result = PASS

BLAS examples

We now run our BLAS example to illustrate GPU performance for dense matrix multiply.

The toolkit provides the CUBLAS library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available here and several advanced examples using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...).

The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides NVBLAS, a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be forcibly linked using the LD_PRELOAD environment variable.

To test NVBLAS, you can download and compile our matrix-matrix multiplication example:

Terminal.png node:
gcc -O3 -Wall -std=c99 matmatmul.c -o matmatmul -lblas

You can first check the performance of the BLAS library on the CPU. For small matrix size (<5000), the provided example will compare the BLAS implementation to a naive jki-loop version of the matrix multiplication:

Terminal.png node:
./matmatmul 2000
 Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000)
 BLAS  - Time elapsed:  1.724E+00 sec.
 J,K,I - Time elapsed:  7.233E+00 sec.

To offload the BLAS computation on the GPU, use:

Terminal.png node:
echo "NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libblas.so" > nvblas.conf
Terminal.png node:
LD_PRELOAD=libnvblas.so ./matmatmul 2000
 [NVBLAS] Config parsed
 Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000)
 BLAS  - Time elapsed:  1.249E-01 sec.

Depending on node hardware, GPU might perform better on larger problems:

Terminal.png node:
./matmatmul 5000
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  2.673E+01 sec.
Terminal.png node:
LD_PRELOAD=libnvblas.so ./matmatmul 5000
 [NVBLAS] Config parsed
 Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000)
 BLAS  - Time elapsed:  1.718E+00 sec.

If you want to measure the time spent on data transfers to the GPU, have a look to the simpleCUBLAS (/tmp/cuda-samples/Samples/simpleCUBLAS) example and instrument the code with timers.

Custom CUDA version or Nvidia drivers

Here, we explain how to use other CUDA versions with Modules, use Nvidia Docker images and install the NVIDIA drivers and compilers before validating the installation on the previous example set.

Older or newer CUDA version using modules

Different CUDA versions can be loaded using "module" command. You should first choose the CUDA toolkit version that you will load with module tool:

Terminal.png node:
module av cuda

------------- /grid5000/spack/v1/share/spack/modules/linux-debian11-x86_64_v2 ----------------
   cuda/11.4.0_gcc-10.4.0    cuda/11.6.2_gcc-10.4.0    cuda/11.7.1_gcc-10.4.0 (D)

Terminal.png node:
module load cuda/11.6.2_gcc-10.4.0
Terminal.png node:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

You should consult CUDA Toolkit and Compatible Driver Versions to ensure compatibility with a specific Cuda version and the Nvidia GPU driver (for instance, Cuda 11.x toolkit requires a driver version >= 450.80.02)

Copy and compile the sample examples

You now have everything installed. For instance, you can compile and run the toolkit examples (see #Compiling the CUDA Toolkit examples for more information).

You will need to override the CUDA path variable, and also load the matching compiler version from modules:

Terminal.png node:
which nvcc
/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e/bin/nvcc
Terminal.png node:
export CUDA_PATH=/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e
Terminal.png node:
module load gcc/10.4.0_gcc-10.4.0

And then you can build and run the examples:

Terminal.png node:
git clone --depth 1 --branch v11.6 https://github.com/NVIDIA/cuda-samples.git /tmp/cuda-samples
Terminal.png node:
cd /tmp/cuda-samples
Terminal.png node:
make -j32
Terminal.png node:
./Samples/0_Introduction/matrixMul/matrixMull
Note.png Note

Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example

Nvidia-docker

A script to install nvidia-docker is available if you want to use Nvidia's images builded for Docker and GPU nodes. This provides an alternative way of making CUDA and Nvidia libraries available to the node. See Nvidia Docker page.

Custom Nvidia driver using deployment

A custom Nvidia driver may be installed on a node if needed. As root privileges are required, we will use kadepoy to deploy a debian11-x64-nfs environment on the GPU node you reserved.

This environment allows you to connect either as root (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:

Terminal.png frontend:
oarsub -I -t deploy -p "gpu_count > 0" -l /nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODE_FILE -e debian11-x64-nfs -k

Once the deployment is terminated, you should be able to connect to the node as root:

Terminal.png frontend:
ssh root@`head -1 $OAR_NODE_FILE`

You can then perform the NVIDIA driver installation:

Terminal.png node:
apt-get -y install linux-headers-amd64 make g++
Terminal.png node:
rmmod nouveau
Terminal.png node:
sh NVIDIA-Linux-x86_64-470.82.01.run -s --no-install-compat32-libs

(warnings about X.Org can safely be ignored)

On the node you can check which NVIDIA drivers are installed with the nvidia-smi tool:

Terminal.png node:
nvidia-smi

Here is an example of the result on the graphique cluster:

root@graphique-4:~# nvidia-smi
Tue Jun 27 19:37:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 26%   28C    P0    46W / 180W |      0MiB /  4043MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:82:00.0 Off |                  N/A |
| 28%   27C    P0    43W / 180W |      0MiB /  4043MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If you want to record your environment with the custom NVidia driver, see Advanced_Kadeploy#Create_a_new_environment_from_a_customized_environment

AMD GPU on Grid'5000

As of October 2021, Grid'5000 has one cluster with AMD GPU: neowise cluster in Lyon.

A neowise GPU may be reserved using:

Terminal.png flyon:
oarsub -t exotic -p neowise -l gpu=1 -I

A full neowise node may be reserved using:

Terminal.png flyon:
oarsub -t exotic -p neowise -I

The default environment on neowise include part of AMD's ROCm stack with AMD GPU driver and basic tools and libraries such as:

  • rocm-smi : get information about GPUs
  • hipcc : HIP compiler
  • hipfy-perl : CUDA to HIP code converter

In addition, most libraries and development tools from ROCm and HIP (available at https://rocmdocs.amd.com/en/latest/Installation_Guide/Software-Stack-for-AMD-GPU.html) are available as modules. Deep Learning Frameworks pytorch and TensorFlow are also known to work.