Submitting your job using SGE

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

Sun Grid Engine is the queue management software used to distribute jobs to the nodes on Bert and Ernie.

There are lots of excellent guides elsewhere so all the topics will not be covered in this wiki. However some initial commands and scripts are introduced here.

In order to run a SGE job, you require a script. Simply make a file using your favourite editor (e.g. vi or nano) and give it a name (e.g. myScript). Then you can being to give the script some commands. This is what will be executed on each node. A very simple script is given below. This will run a UNIX command to tell you how long the machine has been on-line.

NOTE: This is a very simple job script, it may work if there is little running on the HPC at the time, however you may find that it will fail if you have memory and cpu requirements. It is good practice to ensure that limits are in place so that your job is correctly placed into the queue correctly. This way it will not affect other peoples work and your job will be correctly scheduled. Look at Complex submissions for more information on this.

   
#specify the shell type
#$ -S /bin/sh

#run in the current working directory
#$ -cwd

#specify which queue you wish to use
#$ -q amd.q

#run a program command to print hostname and uptime
hostname && uptime
    

To submit this to the sungrid engine queue, simply type;

   
qsub myScript
    

This will then submit the job to an available node and create two files, one called myScript.o1234 and myScript.e1234, where 1234 is the job number. This changes as more jobs are submitted.

When one views the contents of myScript.o1234 (which contains the intended output), you will see the screen output for the program. If you view myScript.e1234, this will contain any errors that have been printed to the screen.

   
[username@bert ~]$ cat myScript.o1234
node001
 12:16:15 up 9 days,  3:41,  0 users,  load average: 0.00, 0.00, 0.00
    

As you can see, this was run on node001.

This time we will run the command multiple times.

   
[username@bert ~]$ for i in {1..10}; do qsub myScript; done
    

This produces 20 files, identical with different job numbers, 10 with .o and 10 with .e. To view the contents of every output file (.o) you can cat all of them using the following command;

WARNING - If you do this and the current working directory contains other files that have *.o* as the filename, it will print those too.

   
[username@bert ~]$ cat *.o*
node001
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.00, 0.00, 0.00
node007
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.00, 0.00, 0.00
node006
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.09, 0.08, 0.07
node005
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.12, 0.11, 0.04
node007
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.00, 0.00, 0.00
node001
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.00, 0.00, 0.00
node006
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.09, 0.08, 0.07
node005
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.12, 0.11, 0.04
node001
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.00, 0.00, 0.00
node007
 12:25:00 up 9 days,  3:49,  0 users,  load average: 0.00, 0.00, 0.00
    

As you can see, the node names are different, meaning that the jobs were not all run on the same node.


Queues

You can choose which queue your job is submitted to and this will alter which nodes your job runs on. The available queues are shown below. If in doubt use the AMD or Intel queues.


Name Nodes Purpose
all.q 003 DON'T USE THIS!!!!
large.q 001,012 Machines with 512GB of RAM and lots of CPUs
fat.q 002 Machines with 1TB of RAM, limited to certain users
intel.q 003,004 8 core Intel CPUs
amd.q 005-011 32/64 core AMD CPUs
metabolomics 13 For metabolomics group only, so they can access node 13