Difference between revisions of "Monitoring your jobs"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
 
Line 1: Line 1:
 
There are various ways for you to monitor and check up on your running and completed jobs.
 
There are various ways for you to monitor and check up on your running and completed jobs.
  
'''See the status of the nodes'''
 
  
The easiest way to see what is happening on the cluster is to firstly check ganglia. This is a web based monitoring application that displays statistics about the cluster and its nodes. To view this, simply visit;
+
=== Check on you've submitted ===
 
 
[http://bert.ibers.aber.ac.uk/ganglia http://bert.ibers.aber.ac.uk/ganglia]
 
 
 
There are a variety of statistics to view. Most useful is probably <nowiki>load_one</nowiki> which shows you the cpu load average on each node. You can also monitor the overall averages along with memory and network usage.
 
 
 
'''Check on you've submitted'''
 
  
 
Once you have submitted your job scripts, you may want to check on the progress of what is running. This is achieved using the <nowiki>qstat</nowiki> command. This will show you your jobs. It might look something like;
 
Once you have submitted your job scripts, you may want to check on the progress of what is running. This is achieved using the <nowiki>qstat</nowiki> command. This will show you your jobs. It might look something like;
  
 
   <nowiki>
 
   <nowiki>
[user@bert ~]$ qstat
+
[user@login01(aber) ~]$ squeue
job-ID prior  name       user        state submit/start at    queue                          slots ja-task-ID
+
            JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
-----------------------------------------------------------------------------------------------------------------
+
            200133      amd myScript      cos R       0:03      1 node008
  758061 0.50042 k2bRC-a1.i user        r    07/20/2014 14:13:33 amd.q@node010.cm.cluster          1      
+
            200134      amd myScript      cos  R      0:01      1 node008
  758062 0.50042 k2bRC-a2.i user        r    07/20/2014 14:13:33 amd.q@node009.cm.cluster          1      
+
            200135      amd myScript      cos R      0:01     1 node008
  758063 0.50042 k2bRC-a3.i user        r    07/20/2014 14:13:48 amd.q@node009.cm.cluster          1      
+
            200136      amd myScript      cos R      0:01     1 node008
  758064 0.50042 k2bRC-a4.i user        r    07/20/2014 14:13:48 amd.q@node008.cm.cluster          1       
+
            200137      amd myScript      cos R      0:01     1 node008
  758065 0.50042 k2bRC-a5.i user        qw    07/20/2014 14:14:03                                    1      
+
            200138      amd myScript      cos R      0:02      1 node008
  758066 0.60208 k2bRC-a6.i user        qw    07/20/2014 14:14:18                                    1      
+
            200139      amd myScript      cos R      0:02      1 node008
 +
            200140      amd myScript      cos R      0:02    1 node008
 +
   
 
   </nowiki>
 
   </nowiki>
  
  
'''Check a the status of a job'''
+
=== Check a the status of a job ===
 
 
You can use the <nowiki>qstat -j JOB_ID</nowiki> command to get information about a running or queued job. Below is what you might find on a running job.
 
 
 
  <nowiki>
 
[user@bert ~]$ qstat -j 758061
 
==============================================================
 
job_number:                758061
 
exec_file:                  job_scripts/758061
 
submission_time:            Sun Jul 20 14:13:32 2014
 
owner:                      user
 
uid:                        100000
 
group:                      users
 
gid:                        100000
 
sge_o_home:                /ibers/ernie/home/user/
 
sge_o_log_name:            user
 
sge_o_path:                /ibers/ernie/home/user/perl5/bin
 
sge_o_shell:                /bin/bash
 
sge_o_workdir:              /ibers/ernie/scratch/user/CGR/dots
 
sge_o_host:                bert
 
account:                    sge
 
cwd:                        /ibers/ernie/scratch/user/CGR/dots
 
stderr_path_list:          NONE:NONE:k2bRC-a1.e
 
hard resource_list:        h_stack=512m,h_vmem=20.0G
 
mail_list:                  user@bert.cm.cluster
 
notify:                    FALSE
 
job_name:                  k2bRC-a1.i
 
stdout_path_list:          NONE:NONE:k2bRC-a1.o
 
jobshare:                  0
 
hard_queue_list:            amd.q
 
script_file:                k2bRC-a1.i
 
usage    1:                cpu=22:21:14, mem=967478.88534 GBs, io=4.91085, vmem=13.401G, maxvmem=13.401G
 
scheduling info:            queue instance "intel.q@node003.cm.cluster" dropped because it is full
 
                            queue instance "amd.q@node008.cm.cluster" dropped because it is full
 
                            queue instance "intel.q@node004.cm.cluster" dropped because it is full
 
                            queue instance "amd.q@node009.cm.cluster" dropped because it is full
 
                            queue instance "amd.q@node007.cm.cluster" dropped because it is full
 
                            queue instance "amd.q@node010.cm.cluster" dropped because it is full
 
 
 
  </nowiki>
 
 
 
 
 
'''Figuring out why you're job is still in the queue'''
 
 
 
You may find it difficult to get your job running and find that it is queued after some time. To work out why this is, first take a look at the job in the queue using the <nowiki>qstat -j JOB_ID</nowiki> command. e.g.
 
 
 
  <nowiki>
 
[mjv08@bert ~]$ qstat -j 756329
 
==============================================================
 
job_number:                756329
 
exec_file:                  job_scripts/756329
 
submission_time:            Fri Jul 18 11:33:07 2014
 
owner:                      user
 
uid:                        10000
 
group:                      users
 
gid:                        10000
 
sge_o_home:                /ibers/ernie/home/user
 
sge_o_log_name:            user
 
sge_o_shell:                /bin/bash
 
sge_o_workdir:              /ibers/ernie/scratch/user/data
 
sge_o_host:                bert
 
account:                    sge
 
cwd:                        /ibers/ernie/scratch/user/data
 
stderr_path_list:          NONE:NONE:y
 
merge:                      y
 
hard resource_list:        h_stack=512m,h_vmem=40G
 
mail_list:                  user@aber.ac.uk
 
notify:                    FALSE
 
job_name:                  align-L7R
 
jobshare:                  0
 
hard_queue_list:            amd.q
 
shell_list:                NONE:/bin/sh
 
job_args:                  /ibers/ernie/home/user/stuff
 
script_file:                align-all.sh
 
parallel environment:  multithread range: 12
 
scheduling info:            queue instance "intel.q@node004.cm.cluster" dropped because it is full
 
                            queue instance "intel.q@node003.cm.cluster" dropped because it is full
 
                            queue instance "amd.q@node010.cm.cluster" dropped because it is full
 
                            queue instance "amd.q@node008.cm.cluster" dropped because it is full
 
                            queue instance "amd.q@node007.cm.cluster" dropped because it is full
 
                            queue instance "amd.q@node009.cm.cluster" dropped because it is full
 
                            cannot run in queue "large.q" because it is not contained in its hard queue list (-q)
 
                            cannot run in queue "metabolomics.q" because it is not contained in its hard queue list (-q)
 
                            cannot run in PE "multithread" because it only offers 0 slots
 
 
 
  </nowiki>
 
 
 
At the bottom you can see that the job has been submitted to the amd queue and 12 CPU multithread cores have been requested. In the scheduling information it tells you that 12 slots are not available.
 
  
You can see the availability of slots using the <nowiki>qstat -f</nowiki> command.
+
You can use the <nowiki>squeue -j JOB_ID</nowiki> command to get information about a running or queued job. Below is what you might find on a running job.
  
 
   <nowiki>
 
   <nowiki>
[user@bert ~]$ qstat -F h_vmem
+
[user@bert ~]$ qstat -j 200133
queuename                      qtype resv/used/tot. load_avg arch          states
+
            JOBID PARTITION    NAME    USER ST       TIME  NODES NODELIST(REASON)
---------------------------------------------------------------------------------
+
            200133       amd myScript      cos  R       0:03      1 node008
amd.q@node005.cm.cluster       BIP  0/25/32        24.02    lx26-amd64   
 
hc:h_vmem=0.000
 
---------------------------------------------------------------------------------
 
amd.q@node006.cm.cluster       BIP  0/31/32        30.96    lx26-amd64   
 
hc:h_vmem=0.000
 
---------------------------------------------------------------------------------
 
amd.q@node007.cm.cluster       BIP  0/32/32        32.21    lx26-amd64   
 
hc:h_vmem=16.000G
 
---------------------------------------------------------------------------------
 
amd.q@node008.cm.cluster      BIP  0/64/64        63.97    lx26-amd64   
 
hc:h_vmem=2.000G
 
---------------------------------------------------------------------------------
 
amd.q@node009.cm.cluster      BIP  0/64/64        63.99    lx26-amd64   
 
hc:h_vmem=2.000G
 
---------------------------------------------------------------------------------
 
amd.q@node010.cm.cluster      BIP  0/64/64        63.94    lx26-amd64   
 
hc:h_vmem=2.000G
 
---------------------------------------------------------------------------------
 
amd.q@node011.cm.cluster      BIP  0/56/64        50.79    lx26-amd64   
 
hc:h_vmem=0.000
 
---------------------------------------------------------------------------------
 
intel.q@node003.cm.cluster    BP    0/8/8          7.99    lx26-amd64   
 
hc:h_vmem=138.000G
 
---------------------------------------------------------------------------------
 
intel.q@node004.cm.cluster    BP    0/8/8          8.01    lx26-amd64   
 
hc:h_vmem=176.000G
 
---------------------------------------------------------------------------------
 
large.q@node001.cm.cluster    BIP  0/26/32        20.77    lx26-amd64   
 
hc:h_vmem=34.000G
 
---------------------------------------------------------------------------------
 
metabolomics.q@node002.cm.clus BP    0/0/8          0.01    lx26-amd64   
 
hc:h_vmem=192.000G
 
 
 
############################################################################
 
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
 
############################################################################
 
756329 0.60208 align-L7R user        qw    07/19/2014 16:51:22    12   
 
 
   </nowiki>
 
   </nowiki>
  
This confirms that on the amd queue, 12 slots aren't available. It also shows you the amount of available memory that each node has right now.
 
 
 
'''It's still in the queue'''
 
 
This gets trickier depending on what you're requesting and what is happening on the HPC. Things to check;
 
 
1)  The first port of call is the <nowiki>qstat -j JOB_ID</nowiki> command. It will usually give you a good indication of what is happening.
 
 
2)  Next, take a look at <nowiki>qstat -F h_vmem</nowiki>, which will tell you what slots are available.
 
 
3)  Check your sge script. Have you simply copied it from a previous run and left something that is preventing it?
 
 
'''I want something more important to run but I don't want to cancel jobs in the queue'''
 
 
The easiest way of doing this is to 'hold' your pending jobs that are in the queue;
 
 
  [user@bert ~]$ qalter -h u <JOB-ID>
 
  
and to release the hold on the job;
+
=== job States ===
  
  [user@bert ~]$ qalter -h U <JOB-ID>
+
R = Job is running
 +
PD = Job is waiting to run
 +
CG = Job is completing

Latest revision as of 16:28, 27 October 2022

There are various ways for you to monitor and check up on your running and completed jobs.


Check on you've submitted

Once you have submitted your job scripts, you may want to check on the progress of what is running. This is achieved using the qstat command. This will show you your jobs. It might look something like;

  
[user@login01(aber) ~]$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            200133       amd myScript      cos  R       0:03      1 node008
            200134       amd myScript      cos  R       0:01      1 node008
            200135       amd myScript      cos  R       0:01      1 node008
            200136       amd myScript      cos  R       0:01      1 node008
            200137       amd myScript      cos  R       0:01      1 node008
            200138       amd myScript      cos  R       0:02      1 node008
            200139       amd myScript      cos  R       0:02      1 node008
            200140       amd myScript      cos  R       0:02     1 node008
     
   


Check a the status of a job

You can use the squeue -j JOB_ID command to get information about a running or queued job. Below is what you might find on a running job.

  
[user@bert ~]$ qstat -j 200133
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            200133       amd myScript      cos  R       0:03      1 node008
   


job States

R = Job is running PD = Job is waiting to run CG = Job is completing