With the increasing complexity and size of both Deep Learning (DL) models and datasets, the computational cost of training these model can be non-trivial, ranging from a few tens of hours to even several days. Exploiting parallelism exhibited inherently in the training process of DL models; we can distribute training on multiple GPUs on single or multiple nodes of Ibex. We will survey the available distributed training frameworks (e.g. PyTorch DDP, PyTorch Lightning, Jax) along with demonstrations and hands-on exercises on how to run them on Ibex resources.
After attending the training, you will be able to:
A Quiz will be conducted after the training, which is mandatory to submit to ensure the continued use of KSL resources.
Date
Venue
Organizers
Didier Barradas Bautista
Visualization Core Laboratory
didier.barradasbautista@kaust.edu.sa
Mohsin A. Shaikh
Supercomputing Core Laboratory
mohsin.shaikh@kaust.edu.sa
Pre-requisites?