Submitting your job using Slurm

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

Slurm is the queue management software used to distribute jobs to the nodes on Bert and Ernie.

There are lots of excellent guides elsewhere so all the topics will not be covered in this wiki. However some initial commands and scripts are introduced here.

In order to run a Slurm job, you require a script. Simply make a file using your favourite editor (e.g. vi or nano) and give it a name (e.g. myScript.slurm). Then you can being to give the script some commands. This is what will be executed on each node. A very simple script is given below. This will run a UNIX command to tell you how long the machine has been on-line.

NOTE: This is a very simple job script, it may work if there is little running on the HPC at the time, however you may find that it will fail if you have memory and cpu requirements. It is good practice to ensure that limits are in place so that your job is correctly placed into the queue correctly. This way it will not affect other peoples work and your job will be correctly scheduled. Look at Complex submissions for more information on this.

   
#!/bin/bash --login
# specify the shell type

# Specify the queue (also known as a partition)
#SBATCH --partition=amd

# run a single task, using a single CPU core
#SBATCH --ntasks=1

# specify the file to save job output to, %J will be replaced with a unique job number
#SBATCH --output=myScript.o%J

# specify the file to save job errors to, %J will be replaced with a unique job number
#SBATCH --error=myScript.e%J


#run a program command to print hostname and uptime
/bin/hostname && /bin/uptime
    

To submit this to the Slurm queue, simply type;

   
sbatch myScript.slurm
    

This will then submit the job to an available node and create two files, one called myScript.o1234 and myScript.e1234, where 1234 is the job number. This changes as more jobs are submitted.

When one views the contents of myScript.o1234 (which contains the intended output), you will see the screen output for the program. If you view myScript.e1234, this will contain any errors that have been printed to the screen.

   
cat myScript.o1234
node008.hpc.private
 16:07:03 up 17 days,  2:05,  0 users,  load average: 0.08, 0.02, 0.01
    

As you can see, this was run on node008.

This time we will run the command multiple times.

   
for i in {1..64}; do sbatch myScript.slurm; done
    

This produces 64 files, identical with different job numbers, 64 with .o and 64 with .e. To view the contents of every output file (.o) you can cat all of them using the following command;

WARNING - If you do this and the current working directory contains other files that have *.o* as the filename, it will print those too.

   
[username@bert ~]$ cat myScript.o*
node008.hpc.private
 16:10:58 up 17 days,  2:09,  0 users,  load average: 0.00, 0.00, 0.00
node008.hpc.private
 16:10:58 up 17 days,  2:09,  0 users,  load average: 0.00, 0.00, 0.00
node008.hpc.private
 16:10:58 up 17 days,  2:09,  0 users,  load average: 0.00, 0.00, 0.00
node009.hpc.private
 16:11:16 up 17 days,  2:06,  0 users,  load average: 1.92, 0.40, 0.13
node009.hpc.private
 16:11:16 up 17 days,  2:06,  0 users,  load average: 1.92, 0.40, 0.13
node009.hpc.private
 16:11:16 up 17 days,  2:06,  0 users,  load average: 1.92, 0.40, 0.13
.
.
.
.
    

As you can see, the node names are different, meaning that the jobs were not all run on the same node.


Queues

You can choose which queue (called partitions by Slurm) your job is submitted to and this will alter which nodes your job runs on. The available queues are shown below. If in doubt use the AMD queue.


Name Nodes Purpose
highmem 012 Machines with 512GB of RAM and lots of CPUs
fat 002 Machines with 1TB of RAM
intel 003,004 8 core Intel CPUs
amd 005-011 32/64 core AMD CPUs
gpu 13 Machines with NVIDIA Graphics Processing Unit (GPU)