Distributed Deep Learning on KSL Platforms
Overview
With the increasing complexity and size of both Deep Learning (DL) models and datasets, the computational cost of training these model can be non-trivial, ranging from a few tens of hours to even several days. Exploiting parallelism exhibited inherently in the training process of DL models; we can distribute training on multiple GPUs on single or multiple nodes of Ibex. We will survey the available distributed training frameworks (e.g. PyTorch DDP, PyTorch Lightning, Jax) along with demonstrations and hands-on exercises on how to run them on Ibex resources.
Learning Outcomes
After attending the training, you will be able to:
- Understand the considerations when refactoring training scripts to scale from 1 to N GPUs
- Understand the data management related to distributed training jobs
- Familiarize how to launch distributed training jobs on Ibex resources
- Understanding the scaling characteristics of your distributed training workload
A Quiz will be conducted after the training, which is mandatory to submit to ensure the continued use of KSL resources.
Date
- February 12th, 2023
- 9:00 am - 12:00 pm
Venue
- Room 5220, Level 5, Building 3
Organizers
Didier Barradas Bautista
Visualization Core Laboratory
didier.barradasbautista@kaust.edu.sa
Mohsin A. Shaikh
Supercomputing Core Laboratory
mohsin.shaikh@kaust.edu.sa
Pre-requisites?
- Have KAUST IT credentials (i.e. the ones you use to access your KAUST email)
- Bring your laptop and have your terminal ready
- Essential knowledge of Linux shell is necessary.
- Have some experience working with Conda package manager.
- Basic training “Data Science on-boarding on KSL platforms” or possess equivalent knowledge