Running on GPUs
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How do I run software that makes use of a GPU?
Objectives
Be able to submit jobs that can run on a GPU
While first designed for making graphics in video games look more impressive, Graphics Processing Unit (GPU) accelerators have since been found to give very high performance at numerically-intensive computation tasks (at the expense of the flexibility offered by a CPU). In particular, they have recently been found to perform very well for machine learning workflows, giving orders of magnitude more speed than the CPU that drives them.
What’s available
Bert provides access to an NVIDIA A100 GPU. Since this node is separate from the nodes used so far for CPU computation, we need to specify the gpu partition for Slurm to allocate the correct nodes.
sbatch
options for GPUs
In order to submit to the GPU partition, we need to add two lines to our job scripts:
#SBATCH --partition=partition_name_goes_here
#SBATCH --gres=gpu:number_of_gpus_goes_here
Here, replace partition_name_goes_here
with a partition name from
the table above, and replace number_of_gpus_goes_here
with the
number of GPUs you want to use (most frequently 1). Slurm will then
find a free GPU and ensure it is reserved for your job. Most
GPU-enabled software (including common machine learning libraries
like Tensorflow and PyTorch) will detect which GPU Slurm has assigned
and automatically use it.
To test this, let’s run an example using Tensorflow.
Firstly, we need to create a new file called tf_simple.py
, for
example using nano
.
The following program will use Tensorflow to perform some basic
arithmetic, after checking that the GPU is available:
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
print(tf.test.is_built_with_cuda())
print(tf.test.gpu_device_name())
print(tf.config.get_visible_devices())
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(a, "times", b, "equals", c)
Now, to submit this to run on the cluster, we can create a job script
called submit_tf.sh
:
#!/bin/bash --login
###
# job name
#SBATCH --job-name=tensorflow_test
# job stdout file
#SBATCH --output=tensorflow_test.out.%J
# job stderr file
#SBATCH --error=tensorflow_test.err.%J
# maximum job time in D-HH:MM
#SBATCH --time=0-00:05
# Specify the GPU partition
#SBATCH --partition=gpu
# Specify how many GPUs we would like to use
#SBATCH --gres=gpu:1
###
# Load Anaconda and activate our environment with Tensorflow installed
module load anaconda3/2020.02
source activate workshop
python tf_simple.py
This can now be submitted to the queue using sbatch
:
$ sbatch submit_tf.sh
Once this runs, then the output will look something like the following:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
True
/device:GPU:0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
tf.Tensor(
[[1. 2. 3.]
[4. 5. 6.]], shape=(2, 3), dtype=float32) times tf.Tensor(
[[1. 2.]
[3. 4.]
[5. 6.]], shape=(3, 2), dtype=float32) equals tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
It shows that the GPU /device:GPU:0
is available to this job.
Train a neural network
Copy the file
/ibers/repository/public/courses/tensorflow/test_train.py
to your home directory. This program is borrowed from the Tensorflow tutorial, and will train a small neural network to recognise handwritten digits, a common example problem in machine learning.Adjust the job script we wrote above to run this code on the GPU, and test whether it works.
Key Points
Use
--partition=gpu
to submit to a partition with GPUsUse
--gres=gpu:1
(or similar) to specify the number of GPUs you need.