Deep Learning Frameworks: Difference between revisions
Line 138: | Line 138: | ||
We provide a pre-built version of pytorch, and we can provide more versions on request. It is the easiest way to use pytorch as there is nothing to install. | We provide a pre-built version of pytorch, and we can provide more versions on request. It is the easiest way to use pytorch as there is nothing to install. | ||
As of February 2021, we provide '''pytorch 1.7.1'''. | |||
To use it: | |||
<pre> | <pre> | ||
Line 147: | Line 151: | ||
; Install pytorch from IBM PowerAI | ; Install pytorch from IBM PowerAI | ||
PowerAI provides '''pytorch 1.2.0'''. | |||
To install it, load conda and create a Python 3.7 environment: | |||
<pre> | <pre> |
Revision as of 15:51, 9 February 2021
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
This page describes installation steps of common Deep Learning frameworks.
Deep learning on x86_64 nodes (common case)
pip will be used to install the frameworks (conda could be used much the same way). Installation is performed under your home directory.
Reserve some GPU nodes with OAR
- Reserve a node with some GPUs (see the Hardware page for the list of sites and clusters with GPUs).
For instance, to reserve one GPU using OAR:
$ oarsub -I -l gpu=1
(remember to add '-q production' option if you want to reserve a GPU from Nancy "production" resources)
To reserve the full node:
$ oarsub -I -l host=1
To reserve a gpu or a full node on a specific cluster, add to the oarsub command:
-p cluster=<clustername>
- Once connected to the node, check GPU presence and the available CUDA version:
$ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | (...)
PyTorch
- Go on PyTorch website to see the installation command that suits you.
For instance (as of May 2020), selecting “Stable”, “Linux”, “Pip”, “Python”, “Cuda 10.1” gives this command to execute:
$ pip3 install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
- Check if PyTorch is correctly installed to works with GPU:
$ python3 -c "import torch; print(torch.cuda.is_available())"
Tensorflow (with Keras)
- Go on Tensorflow website to see the installation commands. As of May 2020 (tensorflow v2.2.0), it is:
$ pip3 install --upgrade pip $ pip3 install tensorflow
- To use GPUs, TensorFlow requires CudNN library. We provide it as a module to load:
$ module load cudnn
- Now check if TensorFlow is correctly installed to works with GPU:
$ python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))"
Note: This install TensorFlow v2. If you need TensorFlow v1, see https://www.tensorflow.org/guide/migrate
MXNet
- Go on MXNet website to see the installation command that suits you.
For instance (as of May 2020), selecting “Linux”, “Python”, “GPU” and “Pip”, the command to execute (in order to use Cuda 10.1) is:
$ pip3 install mxnet-cu101
- Check if PyTorch is correctly installed to works with GPU:
$ python3 -c "import mxnet; print('Num GPUs Available:', mxnet.context.num_gpus())"
Additional resources
- An in-depth tutorial contributed by a Grid'5000 user, Ismael Bada
- Many Docker images exist with ready-to-use Deep Learning software stack. They can be executed using Docker or Singularity tools (using appropriate options to enable GPU usage). See wiki pages to learn how to use these tools in Grid'5000.
- If you want to use virtualenv to manage your Python packages, it is available in Grid'5000 standard environments. Create your environment with python3 -m venv <env_directory> and activate it using source <env_directory>/bin/activate before using pip and installed packages.
- If you prefer to use conda to manage your Python packages, it is available in Grid'5000 as a module. Just execute "module load miniconda3" from a node or a frontend to make it available.
Deep learning on ppc64 nodes
About the ppc64 architecture
Grid'5000 has an IBM cluster with a total of 48 GPUs.
This cluster is using a ppc64 architecture, which is much less common than the usual x86_64 (amd64) architecture. In particular, many deep learning frameworks are primarily targeted at x86_64 and may be hard to use on ppc64.
As a result, if you want to use this cluster for deep learning, you should be ready to invest more time to setup your experiments compared to the usual x86_64 clusters.
Options to install deep learning tools
We provide installation guides for three popular deep learning frameworks: PyTorch, TensorFlow and MXnet.
In general, there are several methods to install deep learning tools, each with advantages and disadvantages:
- modules: we provide pre-built software stacks for several deep learning tools: this is the easiest way to use them. If you need specific versions or build options, contact us.
- IBM PowerAI conda channel: IBM provides a Conda channel with deep learning tools built for ppc64. It is easy to install, but the provided tools versions are often quite out-of-date.
- pip packages: a few tools provide pip packages for ppc64, but this is rare: most pip packages are only available for x86_64
- build from source: this is for advanced users
See below for details on how to install each tool.
Reserve ppc64 GPU nodes with OAR
- Reserve a ppc64 node with GPUs (see the Hardware page of drac cluster for details).
To reserve a full node for one hour:
$ oarsub -I -p cluster=drac -l host=1,walltime=1:00
- Once connected to the node, check GPU presence and the available CUDA version:
$ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.165.02 Driver Version: 418.165.02 CUDA Version: 10.1 | (...)
Note | |
---|---|
ppc64 nodes come with a known-working Nvidia driver version in their default environment, but it only supports CUDA versions up to 10.1. If you install a more recent driver or deploy your own images, you may experience frequent system crashes with recent nvidia drivers on Debian or Ubuntu. CentOS seems unaffected by the crashes. See nvidia developer forum |
IBM PowerAI conda channel
IBM provides a conda channel called PowerAI with several deep learning tools built for ppc64.
It makes it easy to install these tools, but the available versions are often not up-to-date. In addition, we are forced to use PowerAI 1.6.2 specifically because newer versions are not compatible with CUDA 10.1.
For convenience, we provide a version of conda for ppc64 as a module:
$ module load miniconda3 $ conda --help
To install a package, you will need to specify version 1.6.2 of PowerAI, so that conda will install the appropriate version of the package. For instance:
$ conda install example_package powerai-release=1.6.2
See https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.htm for more instructions with PowerAI.
The list of packages can be found here: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/ (don't forget to select the 1.6.2 release)
Note | |
---|---|
Conda packages will be installed in your home directory at |
PyTorch on ppc64
- Load pytorch from modules
We provide a pre-built version of pytorch, and we can provide more versions on request. It is the easiest way to use pytorch as there is nothing to install.
As of February 2021, we provide pytorch 1.7.1.
To use it:
$ module load py-torch $ python3 -c 'import torch; print(torch.cuda.is_available())' True
- Install pytorch from IBM PowerAI
PowerAI provides pytorch 1.2.0.
To install it, load conda and create a Python 3.7 environment:
$ module load miniconda3 $ eval "$(conda shell.bash hook)" $ conda create --name pytorch-ppc64-py37 python=3.7 $ conda activate pytorch-ppc64-py37
Add PowerAI repository:
$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
Install a pytorch version that is built against CUDA 10.1:
$ conda install pytorch powerai-release=1.6.2
Test that it works:
$ python -c "import torch; print(torch.cuda.is_available())" True
Tensorflow on ppc64
- Install with IBM PowerAI
It is the same principle as PyTorch, prepare a conda environment:
$ module load miniconda3 $ eval "$(conda shell.bash hook)" $ conda create --name tensorflow-ppc64-py37 python=3.7 $ conda activate tensorflow-ppc64-py37 $ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
Install the correct version of Tensorflow:
$ conda install tensorflow-gpu powerai-release=1.6.2
Test that it works:
$ python -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))" Num GPUs Available: 4
- Install from pip
Tensorflow is not available in pip for ppc64. However, we can use a non-official pip package. It provides a more recent version of Tensorflow compared to IBM PowerAI (currently: 2.2.0 vs 1.15.4)
Start by creating a virtualenv:
$ python3 -m venv ~/venv-py3-tensorflow $ . ~/venv-py3-tensorflow/bin/activate
Then install Tensorflow from the non-official pip wheel:
$ wget https://powerci.osuosl.org/job/TensorFlow2_PPC64LE_GPU_Release_Build/lastSuccessfulBuild/artifact/tensorflow_pkg/tensorflow-2.2.0-cp37-cp37m-linux_ppc64le.whl $ pip install --upgrade pip setuptools $ pip install ./tensorflow-2.2.0-cp37-cp37m-linux_ppc64le.whl
It might take around 20-30 minutes to install because some dependencies need to be compiled.
At runtime, you will need cudnn. You can install it yourself, or we provide it as a module for convenience:
$ module load cudnn
Test that it works:
$ python -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))" Num GPUs Available: 4
- Build tensorflow from sources
The last option is to build tensorflow from source yourself, which is useful if you need a specific version or specific features. This is for advanced users and we provide no support.
It has been reported to work with a CentOS docker container using https://github.com/tensorflow/build/tree/master/ppc64le_builds
See https://github.com/anji993/build/tree/anji993-patch-1/ppc64le_builds for build instructions on Grid'5000.
Mxnet on ppc64
- Load mxnet from modules
We provide a pre-built version of mxnet, and we can provide more versions on request. It is an easy way to use mxnet as there is nothing to install.
$ module load mxnet $ python3 -c 'import mxnet; print('Num GPUs Available:', mxnet.context.num_gpus())" Num GPUs Available: 4
Nvidia-docker for ppc64
It is possible to find deep learning Docker images for ppc64, for instance: https://hub.docker.com/r/ibmcom/tensorflow-ppc64le
Our Docker installation script does not support ppc64 yet, see https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=12704
User-contributed installation instructions can be found here: https://github.com/anji993/build/tree/anji993-patch-1/ppc64le_builds