Deep Learning Frameworks
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
This page describes installation steps of common Deep Learning frameworks.
Deep learning with Nvidia GPUs on x86_64 nodes (common case)
conda will be used to install the frameworks (pip could be used much the same way). Installation is performed under your home directory.
Please refer to Conda's documentation on Grid'5000.
Reserve some GPU nodes with OAR
- Reserve a node with some GPUs (see the Hardware page for the list of sites and clusters with GPUs).
For instance, to reserve one GPU using OAR:
Remember to add -q production
option if you want to reserve a GPU from Nancy or Rennes "production" resources.
Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the gemini cluster is not very welcome for a user to use only one GPU at a time.
To reserve the full node (with all its GPUs):
To reserve a gpu or a full node on a specific cluster, add to the oarsub command: -p cluster=<clustername>
- Once connected to the node, check GPU presence and the available CUDA version:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ (...)
Which machine should be used to create Conda environment?
Installing Conda packages can be time and resources consuming. You should preferably use a node (instead of a frontend) to perform such an operation. Indeed, frontends might not have enough RAM for conda.
NVIDIA Tools
NVIDIA libraries are available via Conda. It gives you the possibility to manage project specific versions of the NVIDIA CUDA Toolkit, NCCL, and cuDNN. NVIDIA actually maintains their own Conda channel. The versions of CUDA Toolkit available from the default channels are the same as those you will find on the NVIDIA channel.
- Create and activate a dedicated conda environment
- To compare build numbers version from default and nvidia channel
See:
Cudatoolkit
- Install cudatoolkit from nvidia channel.
Note: do not forget to create a dedicated environment before.
Cuda
cuda is available in both conda-forge or nvidia channels.
- Install cuda from nvidia channel:
Note: do not forget to create a dedicated environment before.
- Installing Previous CUDA Releases
All Conda packages released under a specific CUDA version are labeled with that release version. To install a previous version, include that label in the install command to ensure that all cuda dependencies come from the wanted CUDA version. For instance, if you want to install cuda 11.3.0:
- To display the version of Nvidia cuda compiler installed:
PyTorch
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. It can automatically detect GPU availability at run-time.
- Installation
- Load conda and activate your PyTorch environment
- Simple PyTorch installation from nvidia channel
- Custom PyTorch installation : Go on PyTorch website to see the installation command that suits you.
For instance (as of April 2023), for a full installation, you might want to combine for Linux, Pytorch Stable with Python language and specific Cuda version (e.g., 11.7). This can be done by this command:
- Verify your installation
- Check which Python binary is used:
/home/
login
/.conda/envs/
env_name
/bin/python
- Construct a randomly initialized tensor.
>>> import torch >>> x = torch.rand(5, 3) >>> print(x) tensor([[0.3485, 0.6268, 0.8004], [0.3265, 0.9763, 0.5085], [0.6087, 0.6940, 0.8929], [0.2143, 0.6307, 0.5182], [0.0076, 0.6455, 0.5223]])
- Print the Cuda version
>>> import torch >>> print("Pytorch CUDA Version is ", torch.version.cuda) Pytorch CUDA Version is 11.7
- Verify your installation on a GPU node
- Reserve only one GPU (with the associated CPU cores and share of memory) in interactive mode:
- Load conda and activate your Pytorch environment on the node
- Launch python and execute the following code:
>>> import torch >>> print("Whether CUDA is supported by our system: ", torch.cuda.is_available()) Whether CUDA is supported by our system: True
- To know the CUDA device ID and name of the device, you can run:
>>> import torch >>> Cuda_id = torch.cuda.current_device() >>> print("CUDA Device ID: ", torch.cuda.current_device()) CUDA Device ID: 0 >>> print("Name of the current CUDA Device: ", torch.cuda.get_device_name(Cuda_id)) Name of the current CUDA Device: GeForce GTX 1080 Ti
Tensorflow
TensorFlow offers multiple levels of abstraction so you can choose the right one for your needs. Build and train models by using the high-level Keras API, which makes getting started with TensorFlow and machine learning easy.
- Installation
Warning | |
---|---|
By default conda install the current release of CPU-only TensorFlow, to install GPU TensorFlow use |
- on a GPU node
- Reserve only one GPU (with the associated CPU cores and share of memory) in interactive mode:
- Load conda and activate a specific TensorFlow environment (see Conda Documentation)
gpunode :
|
module load conda
conda activate TensorFlow |
- Install TensorFlow from conda-forge channel (takes a long time!) using mamba
- Test the installation : print tf version
>>> import tensorflow as tf >>> print('tensorflow version', tf.__version__) tensorflow version 2.12.0
- Test the installation : list GPU devices
>>> import tensorflow as tf >>> from tensorflow.python.client import device_lib >>> device_lib.list_local_devices() [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 13861454427122602632 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 40231960576 locality { bus_id: 2 numa_node: 1 links { } } incarnation: 5318792213783102490 physical_device_desc: "device: 0, name: A100-PCIE-40GB, pci bus id: 0000:81:00.0, compute capability: 8.0" xla_global_id: 416903419 ]
- Test the installation : multiplication
>>> import tensorflow as tf >>> x = [[2.]] >>> print('hello, {}'.format(tf.matmul(x, x))) hello, [[4.]]
To go further : https://docs.anaconda.com/anaconda/user-guide/tasks/tensorflow/
If you need TensorFlow v1, see https://www.tensorflow.org/guide/migrate
Note | |
---|---|
As alternative to conda installation and as indicated in the official Tensorflow website, you can install tensorflow-gpu inside a conda environment using pip. |
Keras
Keras is a high-level neural networks API, written in python, which is used as a wrapper of TensorFlow. It was developed with a focus on enabling fast experimentation. It's the recommended tool for beginners and even advanced users who don't want to deal and spend too much time with the complexity of low-level libraries as TensorFlow.
- Installation
- Since version 2.4, Keras refocus exclusively on the TensorFlow implementation of Keras. Therefore, to use Keras, you will need to have the TensorFlow package installed:
Note: do not forget to create a dedicated environment before.
- Verify the installation
- Check which Python binary is used:
/home/
login
/.conda/envs/
env_name
/bin/python
- Print the Keras version
>>> from tensorflow import keras >>> print(keras.__version__) 2.10.0
To go further:
Scikit-learn
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.
- Installation
Note: do not forget to create a dedicated environment before.
- Verify your installation
>>> import sklearn >>> sklearn.show_versions() System: python: 3.10.9 (main, Mar 1 2023, 18:23:06) [GCC 11.2.0] executable: /home/xxxx/.conda/envs/test/bin/python machine: Linux-5.10.0-21-amd64-x86_64-with-glibc2.31 Python dependencies: pip: 22.3.1 setuptools: 65.6.3 sklearn: 1.0.2 numpy: 1.23.5 scipy: 1.8.1 Cython: None pandas: None matplotlib: None joblib: 1.2.0 threadpoolctl: 3.1.0 Built with OpenMP: True
To go further:
- scikit-learn Installation
- scikit-learn.org Tutorials
- Dataquest Scikit-learn Tutorial
- Another Python SciKit Learn Tutorial
Additional resources
- If you want to load a module in a non-interactive job, see Modules#Using_modules_in_jobs
- An in-depth tutorial contributed by a Grid'5000 user, Ismael Bada
- Many Docker images exist with ready-to-use Deep Learning software stack. They can be executed using Docker or Singularity tools (using appropriate options to enable GPU usage). See wiki pages to learn how to use these tools in Grid'5000.
- If you want to use virtualenv to manage your Python packages, it is available in Grid'5000 standard environments. Create your environment with
python3 -m venv path/to/env_directory
and activate it usingsource path/to/env_directory/bin/activate
before usingpip
and installing packages. - If you prefer to use conda to manage your Python packages, it is available in Grid'5000 as a module. Just execute
module load conda
" from a node or a frontend to make it available (Consult specific documentation of conda on Grid'5000)
Deep learning with AMD GPUs
conda will be used to install the frameworks (pip could be used much the same way). Installation is performed under your home directory.
Reserve some AMD GPU nodes with OAR
- Reserve a node with some AMD GPUs (see the Hardware page for the list of sites and clusters with GPUs).
Please try to not reserve a single GPU on nodes with many GPUs (e.g. ≥ 4) if you only need to execute code on one GPU. For instance, using the neowise cluster is not very welcome for a user to use only one GPU at a time.
To reserve the full node (with all its GPUs):
- Once connected to the node, check GPU presence:
======================= ROCm System Management Interface ======================= ================================= Concise Info ================================= GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 26.0c 19.0W 930Mhz 350Mhz 255% auto 225.0W 0% 0% ================================================================================ ============================= End of ROCm SMI Log ==============================
PyTorch
- Go on PyTorch website to see the installation command that suits you.
For instance (as of February 2024), selecting “Stable”, “Linux”, “Pip”, “Python”, “ROCM 5.7” gives this command to execute:
neowise :
|
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7 |
- Check if PyTorch is correctly installed to works with GPU:
Num GPUs Available: 8
Tensorflow
- Enable docker on your node (--tmp option is used to use /tmp directory for docker storage)
- Start ROCm's Tensorflow as explained in https://hub.docker.com/r/rocm/tensorflow/
- From within the Docker container, check if Tensorflow is correctly installed to works with GPU:
neowise :
|
python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))" |
Num GPUs Available: 8
Deep learning on ppc64 nodes
About the ppc64 architecture
Grid'5000 has an IBM cluster (drac) with a total of 48 GPUs.
This cluster is using a ppc64 architecture, which is much less common than the usual x86_64 (amd64) architecture. In particular, many deep learning frameworks are primarily targeted at x86_64 and may be hard to use on ppc64.
As a result, if you want to use this cluster for deep learning, you should be ready to invest more time to setup your experiments compared to the usual x86_64 clusters.
Options to install deep learning tools
We provide installation guides for three popular deep learning frameworks: PyTorch, TensorFlow and MXnet.
In general, there are several methods to install deep learning tools, each with advantages and disadvantages:
- modules: we provide pre-built software stacks for several deep learning tools: this is the easiest way to use them. If you need specific versions or build options, contact us.
- IBM PowerAI conda channel: IBM provides a Conda channel with deep learning tools built for ppc64. It is easy to install, but the provided tools versions are often quite out-of-date.
- pip packages: a few tools provide pip packages for ppc64, but this is rare: most pip packages are only available for x86_64
- Docker images: we support installing Docker#Nvidia-docker including support for GPU. You will need to run ppc64 docker images though.
- build from source: this is for advanced users
See below for details on how to install each tool.
Reserve ppc64 GPU nodes with OAR
- Reserve a ppc64 node with GPUs (see the Hardware page of drac cluster for details).
To reserve a full node for one hour:
- Once connected to the node, check GPU presence and the available CUDA version:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.197.02 Driver Version: 418.197.02 CUDA Version: 11.2 | +-----------------------------------------------------------------------------+ (...)
Note | |
---|---|
Nodes in the drac cluster come with a known-working Nvidia driver version in their default environment. If you install a more recent driver or deploy your own images, you may experience frequent system crashes with recent Nvidia drivers on Debian or Ubuntu. CentOS seems unaffected by the crashes. See nvidia developer forum for details. |
IBM PowerAI conda channel
IBM PowerAI provides a Conda channel with dedicated packages compiled for ppc64le: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/
- Load and activate conda
- Install a package '<package>' from IBM PowerAI
drac :
|
conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ <package> |
PyTorch on ppc64
Load pytorch from modules
- Some packages in PowerAI might require older dependencies. For instance, the version of PyTorch is too old for Python 3.8 or Python 3.9, we must use Python 3.7:
drac :
|
conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ pytorch python=3.7 |
We provide a pre-built version of pytorch, and we can provide more versions on request. It is the easiest way to use pytorch as there is nothing to install.
As of November 2021, we provide pytorch 1.7.1. To use it:
Python 3.7.9
True
Note that you need to use the version of Python from our Modules, because Pytorch is built against Python 3.7 and won't work with the version of Python available in Debian 11 (Python 3.9).
That's it: your pytorch projects should now work while the module is loaded.
If you want to load the module in a non-interactive job, see Modules#Using_modules_in_jobs
Install pytorch from IBM PowerAI
PowerAI 1.7.0 provides pytorch 1.3.1
- To install it, load conda and create a Python 3.7 environment:
- Add PowerAI repository:
drac :
|
conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ |
- Install pytorch:
- It will take around 10 minutes to download and install. Test that it works:
True
Tensorflow on ppc64
Install from conda via IBM PowerAI
- PowerAI 1.7.0 provides tensorflow 2.1.3. It is the same principle as PyTorch, prepare a conda environment with Python 3.7:
drac :
|
module load conda
conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ |
- Install Tensorflow with GPU support:
- It will take around 10 minutes to download and install. Test that it works:
drac :
|
python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))" |
Num GPUs Available: 4
Install from pip
Tensorflow is not available in pip for ppc64. However, we can use a non-official pip package. It provides a reasonably recent version: tensorflow 2.3.2.
Unfortunately, as of November 2021, this unofficial package is not compatible with Python 3.9. It means that we have to use Python 3.7 through modules as a workaround.
- Start by loading Python 3.7:
Python 3.7.9
- Then create a virtualenv:
- Then install Tensorflow from the non-official pip wheel:
It takes around 5-10 minutes to install because some dependencies need to be compiled.
- At runtime, you will need cudnn. You can install it yourself, or we provide it as a module for convenience:
- Test that it works:
drac :
|
python3 -c "import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))" |
Num GPUs Available: 4
As before, if you want to load cudnn in a non-interactive job, see Modules#Using_modules_in_jobs
Build tensorflow from source
The last option is to build tensorflow from source yourself, which is useful if you need a specific version or specific features. This is for advanced users and we provide no support.
It has been reported to work with a CentOS docker container using https://github.com/tensorflow/build/tree/master/ppc64le_builds
See https://github.com/anji993/build/tree/anji993-patch-1/ppc64le_builds for build instructions on Grid'5000.
Nvidia-docker for ppc64
- Installation
To easily install Nvidia-docker on a node, see Docker#Nvidia-docker.
- Running ppc64le Docker images
You need to make sure you are running Docker images that are built for ppc64le.
Example sources of ppc64le images:
- https://hub.docker.com/r/ibmcom/tensorflow-ppc64le
- https://hub.docker.com/u/nvidia/
- https://hub.docker.com/u/ppc64le/
- https://hub.docker.com/search/?type=image&architecture=ppc64le
To test tensorflow with an image from IBM:
drac :
|
tensorflowtest="import tensorflow as tf; print('Num GPUs Available:', len(tf.config.experimental.list_physical_devices('GPU')))" |
drac :
|
docker run -it --rm --gpus all ibmcom/tensorflow-ppc64le:latest-gpu-py3 python -c "$tensorflowtest" |
2021-02-15 11:33:10.853846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1686] Adding visible gpu devices: 0, 1, 2, 3 Num GPUs Available: 4