Deep Learning Frameworks: Difference between revisions

From Grid5000
Jump to navigation Jump to search
(66 intermediate revisions by 6 users not shown)
Line 6: Line 6:
This page describes installation steps of common Deep Learning frameworks.
This page describes installation steps of common Deep Learning frameworks.

= Deep learning on x86_64 nodes (common case) =
= Deep learning with Nvidia GPUs on x86_64 nodes (common case) =

''pip'' will be used to install the frameworks (''conda'' could be used much the same way). Installation is performed under your home directory.
''conda'' will be used to install the frameworks (''pip'' could be used much the same way). Installation is performed under your home directory.
Please refer to [[Conda|Conda's documentation on Grid'5000]].

== Reserve some GPU nodes with OAR ==
== Reserve some GPU nodes with OAR ==
Line 15: Line 17:

For instance, to reserve one GPU using OAR:
For instance, to reserve one GPU using OAR:
{{Term|location=frontal| cmd=<code class="command">oarsub</code> -I <code class="replace">-l gpu=1</code>}}
$ oarsub -I -l gpu=1
Remember to add <code class="replace">-q production</code> option if you want to reserve a GPU from Nancy or Rennes "production" resources.
Remember to add '-q production' option if you want to reserve a GPU from Nancy "production" resources.

Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the gemini cluster is not very welcome for a user to use only one GPU at a time.
Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the gemini cluster is not very welcome for a user to use only one GPU at a time.

To reserve the full node (with all its GPUs):
To reserve the full node (with all its GPUs):
<pre>$ oarsub -I -l host=1</pre>
{{Term|location=frontal| cmd=<code class="command">oarsub</code> -I <code class="replace">-l host=1</code>}}
To reserve a gpu or a full node on a specific cluster, add to the oarsub command:
To reserve a gpu or a full node on a specific cluster, add to the oarsub command: <code class="replace">-p cluster=&lt;clustername&gt;</code>
<pre>-p cluster=&lt;clustername&gt;</pre>

*Once connected to the node, check GPU presence and the available CUDA version:
*Once connected to the node, check GPU presence and the available CUDA version:
<pre>$ nvidia-smi  
{{Term|location=node| cmd=<code class="command">nvidia-smi</code>}}<pre>
| NVIDIA-SMI 418.67      Driver Version: 418.67      CUDA Version: 10.1     |
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
== Which machine should be used to create Conda environment? ==
Installing Conda packages can be time and resources consuming. You should preferably use a node (instead of a frontend) to perform such an operation. Indeed, frontends might not have enough RAM for conda.
== NVIDIA Tools ==
NVIDIA libraries are available via Conda. It gives you the possibility to manage project specific versions of the NVIDIA CUDA Toolkit, NCCL, and cuDNN. NVIDIA actually maintains their own Conda channel. The versions of CUDA Toolkit available from the default channels are the same as those you will find on the NVIDIA channel.
* Create and activate a dedicated conda environment
{{Term|location=node|cmd=<code class="command">module load conda</code><br>
<code class="command">conda create --name NvidiaTools</code><br>
<code class="command">conda activate NvidiaTools</code>}}
* To compare build numbers version from default and nvidia channel
{{Term|location=node|cmd=<code class="command">conda search --channel nvidia cudatoolkit</code>}}
* [ Nvidia doc: Installing CUDA Using Conda]
* [ “Best practices” Managing CUDA dependencies with Conda ]
==== Cudatoolkit ====
* Install ''cudatoolkit'' from '''nvidia''' channel.
{{Term|location=node|cmd=<code class="command">conda install cudatoolkit -c nvidia</code>}}
Note: do not forget to create a dedicated environment before.
==== Cuda ====
''cuda'' is available in both '''conda-forge''' or '''nvidia''' channels.
* Install ''cuda'' from '''nvidia''' channel:
{{Term|location=node|cmd=<code class="command">conda install cuda -c nvidia</code>}}
Note: do not forget to create a dedicated environment before.
* Installing Previous CUDA Releases
All Conda packages released under a specific CUDA version are labeled with that release version. To install a previous version, include that label in the install command to ensure that all cuda dependencies come from the wanted CUDA version. For instance, if you want to install cuda 11.3.0:
{{Term|location=node|cmd=<code class="command">conda install cuda -c nvidia/label/cuda-11.3.0</code>}}
* To display the version of Nvidia cuda compiler installed:
{{Term|location=node|cmd=<code class="command">nvcc --version</code>}}

== PyTorch ==
== PyTorch ==

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. It can automatically detect GPU availability at run-time.

*Go on [ PyTorch website] to see the installation command that suits you.
; Installation
For instance (as of May 2020), selecting “Stable”, “Linux”, “Pip”, “Python”, “Cuda 10.1” gives this command to execute:
<pre>$ pip3 install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f</pre>
* Load conda and activate your PyTorch environment
*Check if PyTorch is correctly installed to works with GPU:
{{Term|location=node|cmd=<code class="command">module load conda</code><br>
<pre>$ python3 -c &quot;import torch; print('Num GPUs Available:', torch.cuda.device_count())&quot;</pre>
<code class="command">conda create --name PyTorch</code><br>
<code class="command">conda activate PyTorch</code>}}
* Simple PyTorch installation from nvidia channel
{{Term|location=node|cmd=<code class="command">conda install pytorch -c nvidia</code>}}
* Custom PyTorch installation : Go on [ PyTorch website] to see the installation command that suits you.
For instance (as of April 2023), for a full installation, you might want to combine for Linux, Pytorch Stable with Python language and specific Cuda version (e.g., 11.7). This can be done by this command:
{{Term|location=node|cmd=<code class="command">conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia</code>}}
{{Warning|text=You must adapt the version number of pytorch-cuda according to your version of cuda installed on your system. GPU will not be detected by PyTorch if the version of cuda mismatches with the one installed on your system.}}
; Verify your installation
* Check which Python binary is used:
{{Term|location=node|cmd=<code class="command">which python</code>}}
<code>/home/</code><code class="replace">login</code><code>/.conda/envs/</code><code class="replace">env_name</code><code>/bin/python</code>
* Construct a randomly initialized tensor.
{{Term|location=node|cmd=<code class="command">python</code>}}
>>> import torch
>>> x = torch.rand(5, 3)
>>> print(x)
tensor([[0.3485, 0.6268, 0.8004],
        [0.3265, 0.9763, 0.5085],
        [0.6087, 0.6940, 0.8929],
        [0.2143, 0.6307, 0.5182],
        [0.0076, 0.6455, 0.5223]])
* Print the Cuda version
{{Term|location=node|cmd=<code class="command">python</code>}}
>>> import torch
>>> print("Pytorch CUDA Version is ", torch.version.cuda)
Pytorch CUDA Version is 11.7
; Verify your installation on a GPU node

== Tensorflow (with Keras) ==
* Reserve only one GPU (with the associated CPU cores and share of memory) in interactive mode:
{{Term|location=frontal|cmd=<code class="command">oarsub -l gpu=1 -I</code>}}

* Load conda and activate your Pytorch environment on the node
{{Term|location=gpunode|cmd=<code class="command">module load conda</code><br>
<code class="command">conda activate PyTorch</code>}}

*Go on [ Tensorflow website] to see the installation commands. As of May 2020 (tensorflow v2.2.0), it is:
* Launch python and execute the following code:
<pre>$ pip3 install --upgrade pip
{{Term|location=gpunode|cmd=<code class="command">python</code>}}
$ pip3 install tensorflow</pre>
*To use GPUs, TensorFlow requires CudNN library. We provide it as a module to load:
>>> import torch
<pre>$ module load cudnn</pre>
>>> print("Whether CUDA is supported by our system: ", torch.cuda.is_available())
*Now check if TensorFlow is correctly installed to works with GPU:
Whether CUDA is supported by our system:  True
<pre>$ python3 -c &quot;import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))&quot;

Note: This install TensorFlow v2. If you need TensorFlow v1, see
* To know the CUDA device ID and name of the device, you can run:
{{Term|location=gpunode|cmd=<code class="command">python</code>}}
>>> import torch
>>> Cuda_id = torch.cuda.current_device()
>>> print("CUDA Device ID: ", torch.cuda.current_device())
CUDA Device ID:  0
>>> print("Name of the current CUDA Device: ", torch.cuda.get_device_name(Cuda_id))
Name of the current CUDA Device: GeForce GTX 1080 Ti
== Tensorflow ==
TensorFlow offers multiple levels of abstraction so you can choose the right one for your needs. Build and train models by using the high-level Keras API, which makes getting started with TensorFlow and machine learning easy.
; Installation
{{Warning|text=By default conda install the current release of CPU-only TensorFlow, to install GPU TensorFlow use <code class="command">tensorflow-gpu</code> package name.<bR>
For using TensorFlow with a GPU, refer to the [ TensorFlow documentation] on the topic, specifically the section on [ device placement].}}
{{Warning|text=The '''tensorflow-gpu''' installation consumes too much memory capacity on Grid'5000 front-ends (frontal) and will systematically failed ("out of memory" killed), consider installation only on a GPU node using mamba (instead of conda)}}
; on a GPU node
* Reserve only one GPU (with the associated CPU cores and share of memory) in interactive mode:
{{Term|location=frontal|cmd=<code class="command">oarsub -l gpu=1 -I</code>}}
* Load conda and activate a specific TensorFlow environment (see [[Conda|Conda Documentation]])
{{Term|location=gpunode|cmd=<code class="command">module load conda</code><br>
<code class="command">conda create --name TensorFlow mamba python==3.9 -c conda-forge</code><br>
<code class="command">conda activate TensorFlow</code>}}
* Install TensorFlow from '''conda-forge''' channel (takes a long time!) using mamba
{{Term|location=gpunode|cmd=<code class="command">mamba install -c conda-forge tensorflow-gpu</code>}}
* Test the installation : print tf version
{{Term|location=gpunode|cmd=<code class="command">python</code>}}
>>> import tensorflow as tf
>>> print('tensorflow version', tf.__version__)
tensorflow version 2.12.0
* Test the installation : list GPU devices
>>> import tensorflow as tf
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
incarnation: 13861454427122602632
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 40231960576
locality {
  bus_id: 2
  numa_node: 1
  links {
incarnation: 5318792213783102490
physical_device_desc: "device: 0, name: A100-PCIE-40GB, pci bus id: 0000:81:00.0, compute capability: 8.0"
xla_global_id: 416903419
* Test the installation : multiplication
>>> import tensorflow as tf
>>> x = [[2.]]
>>> print('hello, {}'.format(tf.matmul(x, x)))
hello, [[4.]]
To go further :
If you need TensorFlow v1, see
{{Note|text=As alternative to conda installation and as indicated in the official [ Tensorflow website], you can install tensorflow-gpu inside a conda environment using pip.
{{Term|location=frontend|cmd=<code class="command">module load conda cuda cudnn</code><br>
<code class="command">conda create --name TensorFlow python==3.9</code><br>
<code class="command">conda activate TensorFlow</code>}}
{{Term|location=frontend|cmd=<code class="command">pip install --upgrade pip</code><br><code class="command">pip install tensorflow</code>}}
== Keras ==
Keras is a high-level neural networks API, written in python, which is used as a wrapper of TensorFlow. It was developed with a focus on enabling fast experimentation. It's the recommended tool for beginners and even advanced users who don't want to deal and spend too much time with the complexity of low-level libraries as TensorFlow.
; Installation
* Since version 2.4, Keras refocus exclusively on the TensorFlow implementation of Keras. Therefore, to use Keras, you will need to have the TensorFlow package installed:
{{Term|location=node|cmd=<code class="command">conda install -c conda-forge tensorflow-gpu</code>}}
Note: do not forget to create a dedicated environment before.
; Verify the installation
* Check which Python binary is used:
{{Term|location=node|cmd=<code class="command">which python</code>}}
<code>/home/</code><code class="replace">login</code><code>/.conda/envs/</code><code class="replace">env_name</code><code>/bin/python</code>
* Print the Keras version
{{Term|location=node|cmd=<code class="command">python</code>}}
>>> from tensorflow import keras
>>> print(keras.__version__)
To go further:
* [ Keras exemples]
* [ Keras blog]
== Scikit-learn ==
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.
; Installation
{{Term|location=node|cmd=<code class="command">conda install -c conda-forge scikit-learn</code>}}
Note: do not forget to create a dedicated environment before.
; Verify your installation
{{Term|location=node|cmd=<code class="command">python</code>}}
>>> import sklearn
>>> sklearn.show_versions()
    python: 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0]
executable: /home/xxxx/.conda/envs/test/bin/python
  machine: Linux-5.10.0-21-amd64-x86_64-with-glibc2.31
Python dependencies:
          pip: 22.3.1
  setuptools: 65.6.3
      sklearn: 1.0.2
        numpy: 1.23.5
        scipy: 1.8.1
      Cython: None
      pandas: None
  matplotlib: None
      joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
To go further:
* [ scikit-learn Installation]
* [ Tutorials]
* [ Dataquest Scikit-learn Tutorial]
* [ Another Python SciKit Learn Tutorial]

== MXNet ==

*Go on [ MXNet website] to see the installation command that suits you.
For instance (as of May 2020), selecting “Linux”, “Python”, “GPU” and “Pip”, the command to execute (in order to use Cuda 10.1) is:
<pre>$ pip3 install mxnet-cu101</pre>
*Check if PyTorch is correctly installed to works with GPU:
<pre>$ python3 -c &quot;import mxnet; print('Num GPUs Available:', mxnet.context.num_gpus())&quot;</pre>

== Additional resources ==
== Additional resources ==
* If you want to load a module in a non-interactive job, see [[Modules#Using_modules_in_jobs]]
* An [[User:Ibada/Tuto Deep Learning|in-depth tutorial]] contributed by a Grid'5000 user, Ismael Bada
* An [[User:Ibada/Tuto Deep Learning|in-depth tutorial]] contributed by a Grid'5000 user, Ismael Bada
* Many Docker images exist with ready-to-use Deep Learning software stack. They can be executed using [[Docker]] or [[Singularity]] tools (using appropriate options to enable GPU usage). See wiki pages to learn how to use these tools in Grid'5000.
* Many Docker images exist with ready-to-use Deep Learning software stack. They can be executed using [[Docker]] or [[Singularity]] tools (using appropriate options to enable GPU usage). See wiki pages to learn how to use these tools in Grid'5000.
* If you want to use ''virtualenv'' to manage your Python packages, it is available in Grid'5000 standard environments. Create your environment with ''python3 -m venv <env_directory>'' and activate it using ''source <env_directory>/bin/activate'' before using ''pip'' and installed packages.
* If you want to use ''virtualenv'' to manage your Python packages, it is available in Grid'5000 standard environments. Create your environment with <code>python3 -m venv path/to/env_directory</code> and activate it using <code>source path/to/env_directory/bin/activate</code> before using <code>pip</code> and installing packages.
* If you prefer to use ''conda'' to manage your Python packages, it is available in Grid'5000 as a [[Software using modules|module]]. Just execute "module load miniconda3" from a node or a frontend to make it available.
* If you prefer to use ''conda'' to manage your Python packages, it is available in Grid'5000 as a [[Software using modules|module]]. Just execute <code>module load conda</code>" from a node or a frontend to make it available (Consult specific documentation of [[Conda|conda on Grid'5000]])
= Deep learning with AMD GPUs =
''conda'' will be used to install the frameworks (''pip'' could be used much the same way). Installation is performed under your home directory.
== Reserve some AMD GPU nodes with OAR ==
* Reserve a node with some AMD GPUs (see the [[Hardware#Accelerators_.28GPU.2C_Xeon_Phi.29|Hardware]] page for the list of sites and clusters with GPUs).
{{Term|location=flyon| cmd=<code class="command">oarsub -I -l gpu=1 -t exotic -p "gpu_model like 'Radeon%'"</code>}}
Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the neowise cluster is not very welcome for a user to use only one GPU at a time.
To reserve the full node (with all its GPUs):
{{Term|location=flyon| cmd=<code class="command">oarsub -I -l host=1 -t exotic -p "gpu_model like 'Radeon%'"</code>}}
*Once connected to the node, check GPU presence:
{{Term|location=neowise| cmd=<code class="command">rocm-smi</code>}}
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp  AvgPwr  SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU% 
0    26.0c  19.0W  930Mhz  350Mhz  255%  auto  225.0W    0%  0%   
============================= End of ROCm SMI Log ==============================
== PyTorch ==
{{Note|text=Conda packages are not currently available for ROCm, please use pip instead}}
* Go on [ PyTorch website] to see the installation command that suits you.
For instance (as of February 2024), selecting “Stable”, “Linux”, “Pip”, “Python”, “ROCM 5.7” gives this command to execute:
{{Term|location=neowise| cmd=<code class="command">pip3 install torch torchvision torchaudio --index-url</code>}}
* Check if PyTorch is correctly installed to works with GPU:
{{Term|location=neowise| cmd=<code class="command">python3 -c &quot;import torch; print('Num GPUs Available:', torch.cuda.device_count())&quot;</code>}}
Num GPUs Available: 8
== Tensorflow ==
{{Note|text=On AMD GPU, Tensorflow is only supported using Docker images.}}
* Enable docker on your node (--tmp option is used to use /tmp directory for docker storage)
{{Term|location=neowise| cmd=<code class="command">g5k-setup-docker --tmp</code>}}
* Start ROCm's Tensorflow as explained in
{{Term|location=neowise| cmd=<code class="command">alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /tmp/dockerx:/dockerx'</code>}}
{{Term|location=neowise| cmd=<code class="command">drun rocm/tensorflow:latest</code>}}

* From within the Docker container, check if Tensorflow is correctly installed to works with GPU:
{{Term|location=neowise| cmd=<code class="command">python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"</code>}}
  Num GPUs Available: 8

= Deep learning on ppc64 nodes =
= Deep learning on ppc64 nodes =
Line 75: Line 376:
== About the ppc64 architecture ==
== About the ppc64 architecture ==

Grid'5000 has an [[Grenoble:Hardware#drac|IBM cluster]] with a total of 48 GPUs.
Grid'5000 has an [[Grenoble:Hardware#drac|IBM cluster (drac)]] with a total of 48 GPUs.

This cluster is using a '''ppc64''' architecture, which is much less common than the usual x86_64 (amd64) architecture.
This cluster is using a '''ppc64''' architecture, which is much less common than the usual x86_64 (amd64) architecture.
Line 88: Line 389:
In general, there are several methods to install deep learning tools, each with advantages and disadvantages:
In general, there are several methods to install deep learning tools, each with advantages and disadvantages:

* '''modules:''' we provide [[Environment_modules|pre-built software stacks]] for several deep learning tools: this is the easiest way to use them. If you need specific versions or build options, [[Support|contact us]].
* '''modules:''' we provide [[Modules|pre-built software stacks]] for several deep learning tools: this is the easiest way to use them. If you need specific versions or build options, [[Support|contact us]].
* '''IBM PowerAI conda channel:''' IBM provides a Conda channel with deep learning tools built for ppc64. It is easy to install, but the provided tools versions are often quite out-of-date.
* '''IBM PowerAI conda channel:''' IBM provides a Conda channel with deep learning tools built for ppc64. It is easy to install, but the provided tools versions are often quite out-of-date.
* '''pip packages:''' a few tools provide pip packages for ppc64, but this is rare: most pip packages are only available for x86_64
* '''pip packages:''' a few tools provide pip packages for ppc64, but this is rare: most pip packages are only available for x86_64
Line 101: Line 402:

To reserve a full node for one hour:
To reserve a full node for one hour:
<pre>$ oarsub -I -p cluster=drac -l host=1,walltime=1:00</pre>
{{Term|location=fgrenoble| cmd=<code class="command">oarsub -I -p cluster=drac -t exotic -l host=1,walltime=1:00 </code>}}

*Once connected to the node, check GPU presence and the available CUDA version:
*Once connected to the node, check GPU presence and the available CUDA version:
<pre>$ nvidia-smi  
{{Term|location=drac| cmd=<code class="command">$ nvidia-smi </code>}}
| NVIDIA-SMI 418.197.02  Driver Version: 418.197.02  CUDA Version: 11.2    |
| NVIDIA-SMI 418.165.02  Driver Version: 418.165.02  CUDA Version: 10.1    |

{{Note|text=ppc64 nodes come with a known-working Nvidia driver version in their default environment, but it only supports CUDA versions up to 10.1. If you install a more recent driver or deploy your own images, you may experience frequent system crashes with recent nvidia drivers on Debian or Ubuntu. CentOS seems unaffected by the crashes. See [ nvidia developer forum]}}
{{Note|text=Nodes in the drac cluster come with a known-working Nvidia driver version in their default environment. If you install a more recent driver or deploy your own images, you may experience frequent system crashes with recent Nvidia drivers on Debian or Ubuntu. CentOS seems unaffected by the crashes. See [ nvidia developer forum] for details.}}

== IBM PowerAI conda channel ==
== IBM PowerAI conda channel ==

IBM provides a conda channel called PowerAI with several deep learning tools built for ppc64.
IBM PowerAI provides a Conda channel with dedicated packages compiled for ppc64le:

It makes it easy to install these tools, but the available versions are often not up-to-date. In addition, we are forced to use PowerAI 1.6.2 specifically because newer versions are not compatible with CUDA 10.1.
* Load and activate conda
{{Term|location=drac| cmd=<code class="command">module load conda</code>}}

For convenience, we provide a version of conda for ppc64 as a module:
* Install a package '<package>' from IBM PowerAI
{{Note|text=do not forget to create a dedicated environment before.}}
{{Term|location=drac|cmd=<code class="command">conda install -c <package></code>}}

== PyTorch on ppc64 ==
$ module load miniconda3
$ conda --help
To install a package, you will need to specify version 1.6.2 of PowerAI, so that conda will install the appropriate version of the package. For instance:
$ conda install example_package powerai-release=1.6.2
See for more instructions with PowerAI.

The list of packages can be found here: (don't forget to select the '''1.6.2''' release)
=== Load pytorch from modules ===

{{Note|text=Conda packages will be installed in your home directory at <code>~/.conda/</code>. Deep learning tools can easily take several GB of space: you may need to clean up from time to time or request an [[increased disk quota|Storage]].}}
* Some packages in PowerAI might require older dependencies. For instance, the version of PyTorch is too old for Python 3.8 or Python 3.9, we must use Python 3.7:
{{Term|location=drac| cmd=<code class="command">conda create --name pytorch-ppc64-py37 python=3.7</code>}}
== PyTorch on ppc64 ==
{{Term|location=drac| cmd=<code class="command">conda activate pytorch-ppc64-py37</code>}}
{{Term|location=drac|cmd=<code class="command">conda install -c pytorch python=3.7</code>}}
; Load pytorch from modules

We provide a pre-built version of pytorch, and we can provide more versions on request. It is the easiest way to use pytorch as there is nothing to install.
We provide a pre-built version of pytorch, and we can provide more versions on request. It is the easiest way to use pytorch as there is nothing to install.

As of February 2021, we provide '''pytorch 1.7.1'''.
As of November 2021, we provide '''pytorch 1.7.1'''. To use it:
To use it:

{{Term|location=drac|cmd=<code class="command">module load python py-torch</code>}}
$ module load py-torch
{{Term|location=drac|cmd=<code class="command"> python3 --version</code>}}
$ python3 -c 'import torch; print(torch.cuda.is_available())'
Python 3.7.9
{{Term|location=drac|cmd=<code class="command">python3 -c 'import torch; print(torch.cuda.is_available())'</code>}}

; Install pytorch from IBM PowerAI
Note that you need to use the version of Python from our [[Modules]], because Pytorch is built against Python 3.7 and won't work with the version of Python available in Debian 11 (Python 3.9).

PowerAI 1.6.2 provides '''pytorch 1.2.0''' (later versions in PowerAI are not compatible with the version of CUDA installed on nodes)
That's it: your pytorch projects should now work while the module is loaded.

To install it, load conda and create a Python 3.7 environment:
If you want to load the module in a non-interactive job, see [[Modules#Using_modules_in_jobs]]

=== Install pytorch from IBM PowerAI ===
$ module load miniconda3
$ eval "$(conda shell.bash hook)"
$ conda create --name pytorch-ppc64-py37 python=3.7
$ conda activate pytorch-ppc64-py37

Add PowerAI repository:
PowerAI 1.7.0 provides '''pytorch 1.3.1'''

* To install it, load conda and create a Python 3.7 environment:
$ conda config --prepend channels
{{Term|location=drac| cmd=<code class="command">module load conda"</code>}}
{{Term|location=drac| cmd=<code class="command">conda create --name pytorch-ppc64-py37 python=3.7</code>}}
{{Term|location=drac| cmd=<code class="command">conda activate pytorch-ppc64-py37</code>}}

Install a pytorch version that is built against CUDA 10.1:
* Add PowerAI repository:
{{Term|location=drac| cmd=<code class="command">conda config --prepend channels</code>}}

* Install pytorch:
$ conda install pytorch powerai-release=1.6.2
{{Term|location=drac| cmd=<code class="command">conda install pytorch</code>}}

Test that it works:
* It will take around 10 minutes to download and install. Test that it works:
{{Term|location=drac| cmd=<code class="command">python3 -c "import torch; print(torch.cuda.is_available())"</code>}}

<pre>$ python -c "import torch; print(torch.cuda.is_available())"
{{Note|text=If this doesn't work, make sure that you are using the correct Python interpreter provided by Conda, using e.g. <code>which python3</code>. In some cases, you might have to specify the interpreter as <code>python3.7</code>.}}

== Tensorflow on ppc64 ==
== Tensorflow on ppc64 ==

; Install tensorflow from IBM PowerAI
=== Install from conda via IBM PowerAI ===
PowerAI 1.6.2 provides '''tensorflow 1.15.5''' (later versions in PowerAI are not compatible with the version of CUDA installed on nodes)

It is the same principle as PyTorch, prepare a conda environment:
* PowerAI 1.7.0 provides '''tensorflow 2.1.3'''. It is the same principle as PyTorch, prepare a conda environment with Python 3.7:

{{Term|location=drac|cmd=<code class="command">module load conda</code><br>
$ module load miniconda3
<code class="command">conda create --name tensorflow-ppc64-py37</code><br>
$ eval "$(conda shell.bash hook)"
<code class="command">conda activate tensorflow-ppc64-py37</code><br>
$ conda create --name tensorflow-ppc64-py37 python=3.7
<code class="command">conda config --prepend channels</code>}}
$ conda activate tensorflow-ppc64-py37
$ conda config --prepend channels
Install the correct version of Tensorflow:
$ conda install tensorflow-gpu powerai-release=1.6.2

Test that it works:
* Install Tensorflow with GPU support:
{{Term|location=drac|cmd=<code class="command">conda install tensorflow-gpu</code>}}

* It will take around 10 minutes to download and install. Test that it works:
$ python -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
{{Term|location=drac|cmd=<code class="command">python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"</code>}}
Num GPUs Available: 4
Num GPUs Available: 4

; Install from pip
=== Install from pip ===

Tensorflow is not available in pip for ppc64.  However, we can use a non-official pip package. It provides a quite recent version: '''tensorflow 2.2.0'''
Tensorflow is not available in pip for ppc64.  However, we can use a [ non-official pip package]. It provides a reasonably recent version: '''tensorflow 2.3.2'''.

Start by creating a virtualenv:
Unfortunately, as of November 2021, this unofficial package is not compatible with Python 3.9. It means that we have to use Python 3.7 through modules as a workaround.

* Start by loading Python 3.7:
$ python3 -m venv ~/venv-py3-tensorflow
{{Term|location=drac|cmd=<code class="command">module load python</code>}}
$ . ~/venv-py3-tensorflow/bin/activate
{{Term|location=drac|cmd=<code class="command">python3 --version</code>}}
Python 3.7.9

Then install Tensorflow from the non-official pip wheel:
* Then create a virtualenv:
{{Term|location=drac|cmd=<code class="command">python3 -m venv ~/venv-py3-tensorflow</code>}}
{{Term|location=drac|cmd=<code class="command">. ~/venv-py3-tensorflow/bin/activate</code>}}

* Then install Tensorflow from the non-official pip wheel:
$ wget
{{Term|location=drac|cmd=<code class="command">wget</code>}}
$ pip install --upgrade pip setuptools
{{Term|location=drac|cmd=<code class="command">pip install --upgrade pip setuptools</code>}}
$ pip install ./tensorflow-2.2.0-cp37-cp37m-linux_ppc64le.whl
{{Term|location=drac|cmd=<code class="command">pip install ./tensorflow-2.3.2-cp37-cp37m-linux_ppc64le.whl</code>}}

It might take around 20-30 minutes to install because some dependencies need to be compiled.
It takes around 5-10 minutes to install because some dependencies need to be compiled.

At runtime, you will need cudnn. You can install it yourself, or we provide it as a module for convenience:
* At runtime, you will need cudnn. You can install it yourself, or we provide it as a module for convenience:
{{Term|location=drac|cmd=<code class="command">module load cudnn</code>}}

* Test that it works:
$ module load cudnn
{{Term|location=drac|cmd=<code class="command">python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"</code>}}
  Num GPUs Available: 4

Test that it works:
As before, if you want to load cudnn in a non-interactive job, see [[Modules#Using_modules_in_jobs]]

$ python -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
Num GPUs Available: 4

; Build tensorflow from sources
=== Build tensorflow from source ===

The last option is to build tensorflow from source yourself, which is useful if you need a specific version or specific features. This is for advanced users and we provide no support.
The last option is to build tensorflow from source yourself, which is useful if you need a specific version or specific features. This is for advanced users and we provide no support.
Line 253: Line 528:

See for build instructions on Grid'5000.
See for build instructions on Grid'5000.
== Mxnet on ppc64 ==
; Load mxnet from modules
We provide a pre-built version of mxnet, and we can provide more versions on request. It is an easy way to use mxnet as there is nothing to install.
As of February 2021, we provide '''mxnet 1.7.0'''.
To use it:
$ module load mxnet
$ python3 -c "import mxnet; print('Num GPUs Available:', mxnet.context.num_gpus())"
Num GPUs Available: 4

== Nvidia-docker for ppc64 ==
== Nvidia-docker for ppc64 ==
Line 286: Line 546:

To test tensorflow:
To test tensorflow with an image from IBM:

{{Term|location=drac| cmd=<code class="command">tensorflowtest="import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))" </code>}}
$ tensorflowtest="import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
{{Term|location=drac| cmd=<code class="command">docker run -it --rm --gpus all ibmcom/tensorflow-ppc64le:latest-gpu-py3 python -c "$tensorflowtest"</code>}}
$ docker run -it --rm --gpus all ibmcom/tensorflow-ppc64le:latest-gpu-py3 python -c "$tensorflowtest"
2021-02-15 11:33:10.853846: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0, 1, 2, 3
2021-02-15 11:33:10.853846: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0, 1, 2, 3
Num GPUs Available: 4
Num GPUs Available: 4

Latest revision as of 11:31, 23 February 2024

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This page describes installation steps of common Deep Learning frameworks.

Deep learning with Nvidia GPUs on x86_64 nodes (common case)

conda will be used to install the frameworks (pip could be used much the same way). Installation is performed under your home directory.

Please refer to Conda's documentation on Grid'5000.

Reserve some GPU nodes with OAR

  • Reserve a node with some GPUs (see the Hardware page for the list of sites and clusters with GPUs).

For instance, to reserve one GPU using OAR:

Terminal.png frontal:
oarsub -I -l gpu=1

Remember to add -q production option if you want to reserve a GPU from Nancy or Rennes "production" resources.

Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the gemini cluster is not very welcome for a user to use only one GPU at a time.

To reserve the full node (with all its GPUs):

Terminal.png frontal:
oarsub -I -l host=1

To reserve a gpu or a full node on a specific cluster, add to the oarsub command: -p cluster=<clustername>

  • Once connected to the node, check GPU presence and the available CUDA version:
Terminal.png node:
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |

Which machine should be used to create Conda environment?

Installing Conda packages can be time and resources consuming. You should preferably use a node (instead of a frontend) to perform such an operation. Indeed, frontends might not have enough RAM for conda.


NVIDIA libraries are available via Conda. It gives you the possibility to manage project specific versions of the NVIDIA CUDA Toolkit, NCCL, and cuDNN. NVIDIA actually maintains their own Conda channel. The versions of CUDA Toolkit available from the default channels are the same as those you will find on the NVIDIA channel.

  • Create and activate a dedicated conda environment
Terminal.png node:
module load conda

conda create --name NvidiaTools

conda activate NvidiaTools
  • To compare build numbers version from default and nvidia channel
Terminal.png node:
conda search --channel nvidia cudatoolkit



  • Install cudatoolkit from nvidia channel.
Terminal.png node:
conda install cudatoolkit -c nvidia

Note: do not forget to create a dedicated environment before.


cuda is available in both conda-forge or nvidia channels.

  • Install cuda from nvidia channel:
Terminal.png node:
conda install cuda -c nvidia

Note: do not forget to create a dedicated environment before.

  • Installing Previous CUDA Releases

All Conda packages released under a specific CUDA version are labeled with that release version. To install a previous version, include that label in the install command to ensure that all cuda dependencies come from the wanted CUDA version. For instance, if you want to install cuda 11.3.0:

Terminal.png node:
conda install cuda -c nvidia/label/cuda-11.3.0
  • To display the version of Nvidia cuda compiler installed:
Terminal.png node:
nvcc --version


PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. It can automatically detect GPU availability at run-time.

  • Load conda and activate your PyTorch environment
Terminal.png node:
module load conda

conda create --name PyTorch

conda activate PyTorch
  • Simple PyTorch installation from nvidia channel
Terminal.png node:
conda install pytorch -c nvidia
  • Custom PyTorch installation : Go on PyTorch website to see the installation command that suits you.

For instance (as of April 2023), for a full installation, you might want to combine for Linux, Pytorch Stable with Python language and specific Cuda version (e.g., 11.7). This can be done by this command:

Terminal.png node:
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Warning.png Warning

You must adapt the version number of pytorch-cuda according to your version of cuda installed on your system. GPU will not be detected by PyTorch if the version of cuda mismatches with the one installed on your system.

Verify your installation
  • Check which Python binary is used:
Terminal.png node:
which python


  • Construct a randomly initialized tensor.
Terminal.png node:
>>> import torch
>>> x = torch.rand(5, 3)
>>> print(x)
tensor([[0.3485, 0.6268, 0.8004],
        [0.3265, 0.9763, 0.5085],
        [0.6087, 0.6940, 0.8929],
        [0.2143, 0.6307, 0.5182],
        [0.0076, 0.6455, 0.5223]])
  • Print the Cuda version
Terminal.png node:
>>> import torch
>>> print("Pytorch CUDA Version is ", torch.version.cuda)
Pytorch CUDA Version is 11.7
Verify your installation on a GPU node
  • Reserve only one GPU (with the associated CPU cores and share of memory) in interactive mode:
Terminal.png frontal:
oarsub -l gpu=1 -I
  • Load conda and activate your Pytorch environment on the node
Terminal.png gpunode:
module load conda
conda activate PyTorch
  • Launch python and execute the following code:
Terminal.png gpunode:
>>> import torch
>>> print("Whether CUDA is supported by our system: ", torch.cuda.is_available())
Whether CUDA is supported by our system:  True
  • To know the CUDA device ID and name of the device, you can run:
Terminal.png gpunode:
>>> import torch
>>> Cuda_id = torch.cuda.current_device()
>>> print("CUDA Device ID: ", torch.cuda.current_device())
CUDA Device ID:  0
>>> print("Name of the current CUDA Device: ", torch.cuda.get_device_name(Cuda_id))
Name of the current CUDA Device:  GeForce GTX 1080 Ti


TensorFlow offers multiple levels of abstraction so you can choose the right one for your needs. Build and train models by using the high-level Keras API, which makes getting started with TensorFlow and machine learning easy.

Warning.png Warning

By default conda install the current release of CPU-only TensorFlow, to install GPU TensorFlow use tensorflow-gpu package name.
For using TensorFlow with a GPU, refer to the TensorFlow documentation on the topic, specifically the section on device placement.

Warning.png Warning

The tensorflow-gpu installation consumes too much memory capacity on Grid'5000 front-ends (frontal) and will systematically failed ("out of memory" killed), consider installation only on a GPU node using mamba (instead of conda)

on a GPU node
  • Reserve only one GPU (with the associated CPU cores and share of memory) in interactive mode:
Terminal.png frontal:
oarsub -l gpu=1 -I
Terminal.png gpunode:
module load conda

conda create --name TensorFlow mamba python==3.9 -c conda-forge

conda activate TensorFlow
  • Install TensorFlow from conda-forge channel (takes a long time!) using mamba
Terminal.png gpunode:
mamba install -c conda-forge tensorflow-gpu
  • Test the installation : print tf version
Terminal.png gpunode:
>>> import tensorflow as tf
>>> print('tensorflow version', tf.__version__)
tensorflow version 2.12.0
  • Test the installation : list GPU devices
>>> import tensorflow as tf
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
incarnation: 13861454427122602632
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 40231960576
locality {
  bus_id: 2
  numa_node: 1
  links {
incarnation: 5318792213783102490
physical_device_desc: "device: 0, name: A100-PCIE-40GB, pci bus id: 0000:81:00.0, compute capability: 8.0"
xla_global_id: 416903419
  • Test the installation : multiplication
>>> import tensorflow as tf
>>> x = [[2.]]
>>> print('hello, {}'.format(tf.matmul(x, x)))
hello, [[4.]]

To go further :

If you need TensorFlow v1, see

Note.png Note

As alternative to conda installation and as indicated in the official Tensorflow website, you can install tensorflow-gpu inside a conda environment using pip.

Terminal.png frontend:
module load conda cuda cudnn

conda create --name TensorFlow python==3.9

conda activate TensorFlow
Terminal.png frontend:
pip install --upgrade pip
pip install tensorflow


Keras is a high-level neural networks API, written in python, which is used as a wrapper of TensorFlow. It was developed with a focus on enabling fast experimentation. It's the recommended tool for beginners and even advanced users who don't want to deal and spend too much time with the complexity of low-level libraries as TensorFlow.

  • Since version 2.4, Keras refocus exclusively on the TensorFlow implementation of Keras. Therefore, to use Keras, you will need to have the TensorFlow package installed:
Terminal.png node:
conda install -c conda-forge tensorflow-gpu

Note: do not forget to create a dedicated environment before.

Verify the installation
  • Check which Python binary is used:
Terminal.png node:
which python


  • Print the Keras version
Terminal.png node:
>>> from tensorflow import keras
>>> print(keras.__version__)

To go further:


Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

Terminal.png node:
conda install -c conda-forge scikit-learn

Note: do not forget to create a dedicated environment before.

Verify your installation
Terminal.png node:
>>> import sklearn
>>> sklearn.show_versions()
    python: 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0]
executable: /home/xxxx/.conda/envs/test/bin/python
   machine: Linux-5.10.0-21-amd64-x86_64-with-glibc2.31

Python dependencies:
          pip: 22.3.1
   setuptools: 65.6.3
      sklearn: 1.0.2
        numpy: 1.23.5
        scipy: 1.8.1
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

To go further:

Additional resources

  • If you want to load a module in a non-interactive job, see Modules#Using_modules_in_jobs
  • An in-depth tutorial contributed by a Grid'5000 user, Ismael Bada
  • Many Docker images exist with ready-to-use Deep Learning software stack. They can be executed using Docker or Singularity tools (using appropriate options to enable GPU usage). See wiki pages to learn how to use these tools in Grid'5000.
  • If you want to use virtualenv to manage your Python packages, it is available in Grid'5000 standard environments. Create your environment with python3 -m venv path/to/env_directory and activate it using source path/to/env_directory/bin/activate before using pip and installing packages.
  • If you prefer to use conda to manage your Python packages, it is available in Grid'5000 as a module. Just execute module load conda" from a node or a frontend to make it available (Consult specific documentation of conda on Grid'5000)

Deep learning with AMD GPUs

conda will be used to install the frameworks (pip could be used much the same way). Installation is performed under your home directory.

Reserve some AMD GPU nodes with OAR

  • Reserve a node with some AMD GPUs (see the Hardware page for the list of sites and clusters with GPUs).
Terminal.png flyon:
oarsub -I -l gpu=1 -t exotic -p "gpu_model like 'Radeon%'"

Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the neowise cluster is not very welcome for a user to use only one GPU at a time.

To reserve the full node (with all its GPUs):

Terminal.png flyon:
oarsub -I -l host=1 -t exotic -p "gpu_model like 'Radeon%'"

  • Once connected to the node, check GPU presence:
Terminal.png neowise:
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK    MCLK    Fan   Perf  PwrCap  VRAM%  GPU%  
0    26.0c  19.0W   930Mhz  350Mhz  255%  auto  225.0W    0%   0%    
============================= End of ROCm SMI Log ==============================


Note.png Note

Conda packages are not currently available for ROCm, please use pip instead

For instance (as of February 2024), selecting “Stable”, “Linux”, “Pip”, “Python”, “ROCM 5.7” gives this command to execute:

Terminal.png neowise:
pip3 install torch torchvision torchaudio --index-url
  • Check if PyTorch is correctly installed to works with GPU:
Terminal.png neowise:
python3 -c "import torch; print('Num GPUs Available:', torch.cuda.device_count())"
Num GPUs Available: 8


Note.png Note

On AMD GPU, Tensorflow is only supported using Docker images.

  • Enable docker on your node (--tmp option is used to use /tmp directory for docker storage)
Terminal.png neowise:
g5k-setup-docker --tmp
Terminal.png neowise:
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /tmp/dockerx:/dockerx'
Terminal.png neowise:
drun rocm/tensorflow:latest
  • From within the Docker container, check if Tensorflow is correctly installed to works with GPU:
Terminal.png neowise:
python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
 Num GPUs Available: 8

Deep learning on ppc64 nodes

About the ppc64 architecture

Grid'5000 has an IBM cluster (drac) with a total of 48 GPUs.

This cluster is using a ppc64 architecture, which is much less common than the usual x86_64 (amd64) architecture. In particular, many deep learning frameworks are primarily targeted at x86_64 and may be hard to use on ppc64.

As a result, if you want to use this cluster for deep learning, you should be ready to invest more time to setup your experiments compared to the usual x86_64 clusters.

Options to install deep learning tools

We provide installation guides for three popular deep learning frameworks: PyTorch, TensorFlow and MXnet.

In general, there are several methods to install deep learning tools, each with advantages and disadvantages:

  • modules: we provide pre-built software stacks for several deep learning tools: this is the easiest way to use them. If you need specific versions or build options, contact us.
  • IBM PowerAI conda channel: IBM provides a Conda channel with deep learning tools built for ppc64. It is easy to install, but the provided tools versions are often quite out-of-date.
  • pip packages: a few tools provide pip packages for ppc64, but this is rare: most pip packages are only available for x86_64
  • Docker images: we support installing Docker#Nvidia-docker including support for GPU. You will need to run ppc64 docker images though.
  • build from source: this is for advanced users

See below for details on how to install each tool.

Reserve ppc64 GPU nodes with OAR

To reserve a full node for one hour:

Terminal.png fgrenoble:
oarsub -I -p cluster=drac -t exotic -l host=1,walltime=1:00
  • Once connected to the node, check GPU presence and the available CUDA version:
Terminal.png drac:
$ nvidia-smi
| NVIDIA-SMI 418.197.02   Driver Version: 418.197.02   CUDA Version: 11.2     |
Note.png Note

Nodes in the drac cluster come with a known-working Nvidia driver version in their default environment. If you install a more recent driver or deploy your own images, you may experience frequent system crashes with recent Nvidia drivers on Debian or Ubuntu. CentOS seems unaffected by the crashes. See nvidia developer forum for details.

IBM PowerAI conda channel

IBM PowerAI provides a Conda channel with dedicated packages compiled for ppc64le:

  • Load and activate conda
Terminal.png drac:
module load conda
  • Install a package '<package>' from IBM PowerAI
Note.png Note

do not forget to create a dedicated environment before.

PyTorch on ppc64

Load pytorch from modules

  • Some packages in PowerAI might require older dependencies. For instance, the version of PyTorch is too old for Python 3.8 or Python 3.9, we must use Python 3.7:
Terminal.png drac:
conda create --name pytorch-ppc64-py37 python=3.7
Terminal.png drac:
conda activate pytorch-ppc64-py37

We provide a pre-built version of pytorch, and we can provide more versions on request. It is the easiest way to use pytorch as there is nothing to install.

As of November 2021, we provide pytorch 1.7.1. To use it:

Terminal.png drac:
module load python py-torch
Terminal.png drac:
python3 --version
Python 3.7.9
Terminal.png drac:
python3 -c 'import torch; print(torch.cuda.is_available())'

Note that you need to use the version of Python from our Modules, because Pytorch is built against Python 3.7 and won't work with the version of Python available in Debian 11 (Python 3.9).

That's it: your pytorch projects should now work while the module is loaded.

If you want to load the module in a non-interactive job, see Modules#Using_modules_in_jobs

Install pytorch from IBM PowerAI

PowerAI 1.7.0 provides pytorch 1.3.1

  • To install it, load conda and create a Python 3.7 environment:
Terminal.png drac:
module load conda"
Terminal.png drac:
conda create --name pytorch-ppc64-py37 python=3.7
Terminal.png drac:
conda activate pytorch-ppc64-py37
  • Add PowerAI repository:
  • Install pytorch:
Terminal.png drac:
conda install pytorch
  • It will take around 10 minutes to download and install. Test that it works:
Terminal.png drac:
python3 -c "import torch; print(torch.cuda.is_available())"
Note.png Note

If this doesn't work, make sure that you are using the correct Python interpreter provided by Conda, using e.g. which python3. In some cases, you might have to specify the interpreter as python3.7.

Tensorflow on ppc64

Install from conda via IBM PowerAI

  • PowerAI 1.7.0 provides tensorflow 2.1.3. It is the same principle as PyTorch, prepare a conda environment with Python 3.7:
Terminal.png drac:
module load conda

conda create --name tensorflow-ppc64-py37
conda activate tensorflow-ppc64-py37

conda config --prepend channels
  • Install Tensorflow with GPU support:
Terminal.png drac:
conda install tensorflow-gpu
  • It will take around 10 minutes to download and install. Test that it works:
Terminal.png drac:
python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
Num GPUs Available: 4

Install from pip

Tensorflow is not available in pip for ppc64. However, we can use a non-official pip package. It provides a reasonably recent version: tensorflow 2.3.2.

Unfortunately, as of November 2021, this unofficial package is not compatible with Python 3.9. It means that we have to use Python 3.7 through modules as a workaround.

  • Start by loading Python 3.7:
Terminal.png drac:
module load python
Terminal.png drac:
python3 --version
Python 3.7.9
  • Then create a virtualenv:
Terminal.png drac:
python3 -m venv ~/venv-py3-tensorflow
Terminal.png drac:
. ~/venv-py3-tensorflow/bin/activate
  • Then install Tensorflow from the non-official pip wheel:
Terminal.png drac:
pip install --upgrade pip setuptools
Terminal.png drac:
pip install ./tensorflow-2.3.2-cp37-cp37m-linux_ppc64le.whl

It takes around 5-10 minutes to install because some dependencies need to be compiled.

  • At runtime, you will need cudnn. You can install it yourself, or we provide it as a module for convenience:
Terminal.png drac:
module load cudnn
  • Test that it works:
Terminal.png drac:
python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
 Num GPUs Available: 4

As before, if you want to load cudnn in a non-interactive job, see Modules#Using_modules_in_jobs

Build tensorflow from source

The last option is to build tensorflow from source yourself, which is useful if you need a specific version or specific features. This is for advanced users and we provide no support.

It has been reported to work with a CentOS docker container using

See for build instructions on Grid'5000.

Nvidia-docker for ppc64


To easily install Nvidia-docker on a node, see Docker#Nvidia-docker.

Running ppc64le Docker images

You need to make sure you are running Docker images that are built for ppc64le.

Example sources of ppc64le images:

To test tensorflow with an image from IBM:

Terminal.png drac:
tensorflowtest="import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
Terminal.png drac:
docker run -it --rm --gpus all ibmcom/tensorflow-ppc64le:latest-gpu-py3 python -c "$tensorflowtest"
2021-02-15 11:33:10.853846: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0, 1, 2, 3
Num GPUs Available: 4