Deep Learning Frameworks

From Grid5000
Jump to navigation Jump to search
Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This page describes installation steps of common Deep Learning frameworks.

Deep learning with Nvidia GPUs on x86_64 nodes (common case)

pip will be used to install the frameworks (conda could be used much the same way). Installation is performed under your home directory.

Reserve some GPU nodes with OAR

  • Reserve a node with some GPUs (see the Hardware page for the list of sites and clusters with GPUs).

For instance, to reserve one GPU using OAR:

$ oarsub -I -l gpu=1

Remember to add '-q production' option if you want to reserve a GPU from Nancy "production" resources.

Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the gemini cluster is not very welcome for a user to use only one GPU at a time.

To reserve the full node (with all its GPUs):

$ oarsub -I -l host=1

To reserve a gpu or a full node on a specific cluster, add to the oarsub command:

-p cluster=<clustername>
  • Once connected to the node, check GPU presence and the available CUDA version:
$ nvidia-smi 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
(...)

PyTorch

For instance (as of November 2021), selecting “Stable”, “Linux”, “Pip”, “Python”, “Cuda 11.3” gives this command to execute:

$ pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
  • Check if PyTorch is correctly installed to work with GPU:
$ python3 -c "import torch; print('Num GPUs Available:', torch.cuda.device_count())"

Tensorflow (with Keras)

  • Go on Tensorflow website to see the installation commands. As of November 2021 (tensorflow v2.7.0), it is:
$ pip3 install --upgrade pip
$ pip3 install tensorflow
  • To use GPUs, TensorFlow requires CudNN library. We provide it as a module to load:
$ module load cudnn
  • Now check if TensorFlow is correctly installed to works with GPU:
$ python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"

Note: This install TensorFlow v2. If you need TensorFlow v1, see https://www.tensorflow.org/guide/migrate

MXNet

Unfortunately, we cannot currently support MXNet on Debian 11. If you need MXNet, please contact the technical team.

Additional resources

  • If you want to load a module in a non-interactive job, see Environment_modules#Using_modules_in_jobs
  • An in-depth tutorial contributed by a Grid'5000 user, Ismael Bada
  • Many Docker images exist with ready-to-use Deep Learning software stack. They can be executed using Docker or Singularity tools (using appropriate options to enable GPU usage). See wiki pages to learn how to use these tools in Grid'5000.
  • If you want to use virtualenv to manage your Python packages, it is available in Grid'5000 standard environments. Create your environment with python3 -m venv path/to/env_directory and activate it using source path/to/env_directory/bin/activate before using pip and installing packages.
  • If you prefer to use conda to manage your Python packages, it is available in Grid'5000 as a module. Just execute "module load miniconda3" from a node or a frontend to make it available (Consult specific documentation of conda on Grid'5000)

Deep learning with AMD GPUs

pip will be used to install the frameworks (conda could be used much the same way). Installation is performed under your home directory.

Reserve some AMD GPU nodes with OAR

Note: As of July 2021, only "neowise" cluster from testing queue in Lyon has AMD GPUs.

  • Reserve a node with some AMD GPUs (see the Hardware page for the list of sites and clusters with GPUs).
$ oarsub -I -l gpu=1 -t exotic -p "gpu_model like 'Radeon%'"

Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the neowise cluster is not very welcome for a user to use only one GPU at a time.

To reserve the full node (with all its GPUs):

$ oarsub -I -l host=1 -t exotic -p "gpu_model like 'Radeon%'"
  • Once connected to the node, check GPU presence:
$ rocm-smi 

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK    MCLK    Fan   Perf  PwrCap  VRAM%  GPU%  
0    26.0c  19.0W   930Mhz  350Mhz  255%  auto  225.0W    0%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

PyTorch

For instance (as of December 2021), selecting “Stable”, “Linux”, “Pip”, “Python”, “ROCM 4.2” gives this command to execute:

pip3 install torch -f https://download.pytorch.org/whl/rocm4.2/torch_stable.html
pip3 install torchvision==0.11.1 -f https://download.pytorch.org/whl/rocm4.2/torch_stable.html
  • Check if PyTorch is correctly installed to works with GPU:
$ python3 -c "import torch; print('Num GPUs Available:', torch.cuda.device_count())"

Tensorflow

On AMD GPU, Tensorflow is only supported using Docker images.

  • Enable docker on your node (--tmp option is used to use /tmp directory for docker storage)
g5k-setup-docker --tmp
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /tmp/dockerx:/dockerx'
drun rocm/tensorflow:latest
  • From within the Docker container, check if Tensorflow is correctly installed to works with GPU:
python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"

Deep learning on ppc64 nodes

About the ppc64 architecture

Grid'5000 has an IBM cluster with a total of 48 GPUs.

This cluster is using a ppc64 architecture, which is much less common than the usual x86_64 (amd64) architecture. In particular, many deep learning frameworks are primarily targeted at x86_64 and may be hard to use on ppc64.

As a result, if you want to use this cluster for deep learning, you should be ready to invest more time to setup your experiments compared to the usual x86_64 clusters.

Options to install deep learning tools

We provide installation guides for three popular deep learning frameworks: PyTorch, TensorFlow and MXnet.

In general, there are several methods to install deep learning tools, each with advantages and disadvantages:

  • modules: we provide pre-built software stacks for several deep learning tools: this is the easiest way to use them. If you need specific versions or build options, contact us.
  • IBM PowerAI conda channel: IBM provides a Conda channel with deep learning tools built for ppc64. It is easy to install, but the provided tools versions are often quite out-of-date.
  • pip packages: a few tools provide pip packages for ppc64, but this is rare: most pip packages are only available for x86_64
  • Docker images: we support installing Docker#Nvidia-docker including support for GPU. You will need to run ppc64 docker images though.
  • build from source: this is for advanced users

See below for details on how to install each tool.

Reserve ppc64 GPU nodes with OAR

To reserve a full node for one hour:

$ oarsub -I -p cluster=drac -l host=1,walltime=1:00
  • Once connected to the node, check GPU presence and the available CUDA version:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.197.02   Driver Version: 418.197.02   CUDA Version: 11.2     |
+-----------------------------------------------------------------------------+
(...)
Note.png Note

Nodes in the drac cluster come with a known-working Nvidia driver version in their default environment. If you install a more recent driver or deploy your own images, you may experience frequent system crashes with recent Nvidia drivers on Debian or Ubuntu. CentOS seems unaffected by the crashes. See nvidia developer forum for details.

IBM PowerAI conda channel

IBM provides a conda channel called PowerAI with several deep learning tools built for ppc64. It makes it easy to install these tools, but the available versions are often not up-to-date.

For convenience, we provide a version of conda for ppc64 as a module:

$ module load miniconda3
$ conda --help

See https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.htm for more instructions with PowerAI.

The list of packages can be found here: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/ (the latest 1.7.0 release is known to work)

Note.png Note

Conda packages will be installed in your home directory at ~/.conda/. Deep learning tools can easily take several GB of space: you may need to clean up from time to time or request more Storage.

PyTorch on ppc64

Load pytorch from modules

We provide a pre-built version of pytorch, and we can provide more versions on request. It is the easiest way to use pytorch as there is nothing to install.

As of November 2021, we provide pytorch 1.7.1. To use it:

$ module load python py-torch
$ python3 --version
Python 3.7.9
$ python3 -c 'import torch; print(torch.cuda.is_available())'
True

Note that you need to use the version of Python from our Environment_modules, because Pytorch is built against Python 3.7 and won't work with the version of Python available in Debian 11 (Python 3.9).

That's it: your pytorch projects should now work while the module is loaded.

If you want to load the module in a non-interactive job, see Environment_modules#Using_modules_in_jobs

Install pytorch from IBM PowerAI

PowerAI 1.7.0 provides pytorch 1.3.1

To install it, load conda and create a Python 3.7 environment:

$ module load miniconda3
$ eval "$(conda shell.bash hook)"
$ conda create --name pytorch-ppc64-py37 python=3.7
$ conda activate pytorch-ppc64-py37

Add PowerAI repository:

$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/

Install pytorch:

$ conda install pytorch

It will take around 10 minutes to download and install. Test that it works:

$ python3 -c "import torch; print(torch.cuda.is_available())"
True
Note.png Note

If this doesn't work, make sure that you are using the correct Python interpreter provided by Conda, using e.g. which python3. In some cases, you might have to specify the interpreter as python3.7.

Tensorflow on ppc64

Install from pip

Tensorflow is not available in pip for ppc64. However, we can use a non-official pip package. It provides a reasonably recent version: tensorflow 2.3.2.

Unfortunately, as of November 2021, this unofficial package is not compatible with Python 3.9. It means that we have to use Python 3.7 through environment modules as a workaround.

Start by loading Python 3.7:

$ module load python
$ python3 --version
Python 3.7.9

Then create a virtualenv:

$ python3 -m venv ~/venv-py3-tensorflow
$ . ~/venv-py3-tensorflow/bin/activate

Then install Tensorflow from the non-official pip wheel:

$ wget https://powerci.osuosl.org/job/TensorFlow2_PPC64LE_GPU_Release_Build/20/artifact/tensorflow_pkg/tensorflow-2.3.2-cp37-cp37m-linux_ppc64le.whl
$ pip install --upgrade pip setuptools
$ pip install ./tensorflow-2.3.2-cp37-cp37m-linux_ppc64le.whl

It takes around 5-10 minutes to install because some dependencies need to be compiled.

At runtime, you will need cudnn. You can install it yourself, or we provide it as a module for convenience:

$ module load cudnn

Test that it works:

$ python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
Num GPUs Available: 4

As before, if you want to load cudnn in a non-interactive job, see Environment_modules#Using_modules_in_jobs

Install from IBM PowerAI

PowerAI 1.7.0 provides tensorflow 2.1.3. It is the same principle as PyTorch, prepare a conda environment with Python 3.7:

$ module load miniconda3
$ eval "$(conda shell.bash hook)"
$ conda create --name tensorflow-ppc64-py37 python=3.7
$ conda activate tensorflow-ppc64-py37
$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/

Install Tensorflow with GPU support:

$ conda install tensorflow-gpu

It will take around 10 minutes to download and install. Test that it works:

$ python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
Num GPUs Available: 4

Build tensorflow from source

The last option is to build tensorflow from source yourself, which is useful if you need a specific version or specific features. This is for advanced users and we provide no support.

It has been reported to work with a CentOS docker container using https://github.com/tensorflow/build/tree/master/ppc64le_builds

See https://github.com/anji993/build/tree/anji993-patch-1/ppc64le_builds for build instructions on Grid'5000.

Mxnet on ppc64

Load mxnet from modules

We provide a pre-built version of mxnet in our Environment_modules. It is an easy way to use mxnet as there is nothing to install.

As of November 2021, we provide mxnet 1.7.0. To use it:

$ module load python mxnet
$ python3 --version
Python 3.7.9
$ python3 -c "import mxnet; print('Num GPUs Available:', mxnet.context.num_gpus())"
Num GPUs Available: 4

Note that you also need to use the version of Python from our Environment_modules, because Mxnet is built against Python 3.7 and won't work with the version of Python available in Debian 11 (Python 3.9).

Nvidia-docker for ppc64

Installation

To easily install Nvidia-docker on a node, see Docker#Nvidia-docker.

Running ppc64le Docker images

You need to make sure you are running Docker images that are built for ppc64le.

Example sources of ppc64le images:

To test tensorflow with an image from IBM:

$ tensorflowtest="import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
$ docker run -it --rm --gpus all ibmcom/tensorflow-ppc64le:latest-gpu-py3 python -c "$tensorflowtest"
2021-02-15 11:33:10.853846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1686] Adding visible gpu devices: 0, 1, 2, 3
Num GPUs Available: 4