Difference between revisions of "Monitoring your jobs"
Line 69: | Line 69: | ||
'''Figuring out why you're job is still in the queue''' | '''Figuring out why you're job is still in the queue''' | ||
+ | |||
+ | You may find it difficult to get your job running and find that it is queued after some time. To work out why this is, first take a look at the job in the queue using the <nowiki>qstat -j JOB_ID</nowiki> command. e.g. | ||
+ | |||
+ | <nowiki> | ||
+ | [mjv08@bert ~]$ qstat -j 756329 | ||
+ | ============================================================== | ||
+ | job_number: 756329 | ||
+ | exec_file: job_scripts/756329 | ||
+ | submission_time: Fri Jul 18 11:33:07 2014 | ||
+ | owner: user | ||
+ | uid: 10000 | ||
+ | group: users | ||
+ | gid: 10000 | ||
+ | sge_o_home: /ibers/ernie/home/user | ||
+ | sge_o_log_name: user | ||
+ | sge_o_shell: /bin/bash | ||
+ | sge_o_workdir: /ibers/ernie/scratch/user/data | ||
+ | sge_o_host: bert | ||
+ | account: sge | ||
+ | cwd: /ibers/ernie/scratch/user/data | ||
+ | stderr_path_list: NONE:NONE:y | ||
+ | merge: y | ||
+ | hard resource_list: h_stack=512m,h_vmem=40G | ||
+ | mail_list: user@aber.ac.uk | ||
+ | notify: FALSE | ||
+ | job_name: align-L7R | ||
+ | jobshare: 0 | ||
+ | hard_queue_list: amd.q | ||
+ | shell_list: NONE:/bin/sh | ||
+ | job_args: /ibers/ernie/home/user/stuff | ||
+ | script_file: align-all.sh | ||
+ | parallel environment: multithread range: 12 | ||
+ | scheduling info: queue instance "intel.q@node004.cm.cluster" dropped because it is full | ||
+ | queue instance "intel.q@node003.cm.cluster" dropped because it is full | ||
+ | queue instance "amd.q@node010.cm.cluster" dropped because it is full | ||
+ | queue instance "amd.q@node008.cm.cluster" dropped because it is full | ||
+ | queue instance "amd.q@node007.cm.cluster" dropped because it is full | ||
+ | queue instance "amd.q@node009.cm.cluster" dropped because it is full | ||
+ | cannot run in queue "large.q" because it is not contained in its hard queue list (-q) | ||
+ | cannot run in queue "metabolomics.q" because it is not contained in its hard queue list (-q) | ||
+ | cannot run in PE "multithread" because it only offers 0 slots | ||
+ | |||
+ | </nowiki> | ||
+ | |||
+ | At the bottom you can see that the job has been submitted to the amd queue and 12 CPU multithread cores have been requested. In the scheduling information it tells you that 12 slots are not available. | ||
+ | |||
+ | You can see the availability of slots using the <nowiki>qstat -f</nowiki> command. | ||
+ | |||
+ | <nowiki> | ||
+ | [user@bert ~]$ qstat -f | ||
+ | queuename qtype resv/used/tot. load_avg arch states | ||
+ | --------------------------------------------------------------------------------- | ||
+ | amd.q@node005.cm.cluster BIP 0/25/32 24.44 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | amd.q@node006.cm.cluster BIP 0/31/32 31.00 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | amd.q@node007.cm.cluster BIP 0/32/32 31.93 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | amd.q@node008.cm.cluster BIP 0/64/64 63.96 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | amd.q@node009.cm.cluster BIP 0/64/64 63.88 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | amd.q@node010.cm.cluster BIP 0/64/64 63.95 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | amd.q@node011.cm.cluster BIP 0/56/64 51.01 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | intel.q@node003.cm.cluster BP 0/8/8 8.12 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | intel.q@node004.cm.cluster BP 0/7/8 7.96 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | large.q@node001.cm.cluster BIP 0/26/32 21.86 lx26-amd64 | ||
+ | --------------------------------------------------------------------------------- | ||
+ | metabolomics.q@node002.cm.clus BP 0/0/8 0.01 lx26-amd64 | ||
+ | |||
+ | ############################################################################ | ||
+ | - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS | ||
+ | ############################################################################ | ||
+ | 756329 0.60208 align-L7R user qw 07/19/2014 16:51:22 12 | ||
+ | [user@bert ~]$ | ||
+ | </nowiki> |
Revision as of 12:00, 21 July 2014
There are various ways for you to monitor and check up on your running and completed jobs.
See the status of the nodes
The easiest way to see what is happening on the cluster is to firstly check ganglia. This is a web based monitoring application that displays statistics about the cluster and its nodes. To view this, simply visit;
http://bert.ibers.aber.ac.uk/ganglia
There are a variety of statistics to view. Most useful is probably load_one which shows you the cpu load average on each node. You can also monitor the overall averages along with memory and network usage.
Check on you've submitted
Once you have submitted your job scripts, you may want to check on the progress of what is running. This is achieved using the qstat command. This will show you your jobs. It might look something like;
[user@bert ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 758061 0.50042 k2bRC-a1.i user r 07/20/2014 14:13:33 amd.q@node010.cm.cluster 1 758062 0.50042 k2bRC-a2.i user r 07/20/2014 14:13:33 amd.q@node009.cm.cluster 1 758063 0.50042 k2bRC-a3.i user r 07/20/2014 14:13:48 amd.q@node009.cm.cluster 1 758064 0.50042 k2bRC-a4.i user r 07/20/2014 14:13:48 amd.q@node008.cm.cluster 1 758065 0.50042 k2bRC-a5.i user qw 07/20/2014 14:14:03 1 758066 0.60208 k2bRC-a6.i user qw 07/20/2014 14:14:18 1
Check a the status of a job
You can use the qstat -j JOB_ID command to get information about a running or queued job. Below is what you might find on a running job.
[user@bert ~]$ qstat -j 758061 ============================================================== job_number: 758061 exec_file: job_scripts/758061 submission_time: Sun Jul 20 14:13:32 2014 owner: user uid: 100000 group: users gid: 100000 sge_o_home: /ibers/ernie/home/user/ sge_o_log_name: user sge_o_path: /ibers/ernie/home/user/perl5/bin sge_o_shell: /bin/bash sge_o_workdir: /ibers/ernie/scratch/user/CGR/dots sge_o_host: bert account: sge cwd: /ibers/ernie/scratch/user/CGR/dots stderr_path_list: NONE:NONE:k2bRC-a1.e hard resource_list: h_stack=512m,h_vmem=20.0G mail_list: user@bert.cm.cluster notify: FALSE job_name: k2bRC-a1.i stdout_path_list: NONE:NONE:k2bRC-a1.o jobshare: 0 hard_queue_list: amd.q script_file: k2bRC-a1.i usage 1: cpu=22:21:14, mem=967478.88534 GBs, io=4.91085, vmem=13.401G, maxvmem=13.401G scheduling info: queue instance "intel.q@node003.cm.cluster" dropped because it is full queue instance "amd.q@node008.cm.cluster" dropped because it is full queue instance "intel.q@node004.cm.cluster" dropped because it is full queue instance "amd.q@node009.cm.cluster" dropped because it is full queue instance "amd.q@node007.cm.cluster" dropped because it is full queue instance "amd.q@node010.cm.cluster" dropped because it is full
Figuring out why you're job is still in the queue
You may find it difficult to get your job running and find that it is queued after some time. To work out why this is, first take a look at the job in the queue using the qstat -j JOB_ID command. e.g.
[mjv08@bert ~]$ qstat -j 756329 ============================================================== job_number: 756329 exec_file: job_scripts/756329 submission_time: Fri Jul 18 11:33:07 2014 owner: user uid: 10000 group: users gid: 10000 sge_o_home: /ibers/ernie/home/user sge_o_log_name: user sge_o_shell: /bin/bash sge_o_workdir: /ibers/ernie/scratch/user/data sge_o_host: bert account: sge cwd: /ibers/ernie/scratch/user/data stderr_path_list: NONE:NONE:y merge: y hard resource_list: h_stack=512m,h_vmem=40G mail_list: user@aber.ac.uk notify: FALSE job_name: align-L7R jobshare: 0 hard_queue_list: amd.q shell_list: NONE:/bin/sh job_args: /ibers/ernie/home/user/stuff script_file: align-all.sh parallel environment: multithread range: 12 scheduling info: queue instance "intel.q@node004.cm.cluster" dropped because it is full queue instance "intel.q@node003.cm.cluster" dropped because it is full queue instance "amd.q@node010.cm.cluster" dropped because it is full queue instance "amd.q@node008.cm.cluster" dropped because it is full queue instance "amd.q@node007.cm.cluster" dropped because it is full queue instance "amd.q@node009.cm.cluster" dropped because it is full cannot run in queue "large.q" because it is not contained in its hard queue list (-q) cannot run in queue "metabolomics.q" because it is not contained in its hard queue list (-q) cannot run in PE "multithread" because it only offers 0 slots
At the bottom you can see that the job has been submitted to the amd queue and 12 CPU multithread cores have been requested. In the scheduling information it tells you that 12 slots are not available.
You can see the availability of slots using the qstat -f command.
[user@bert ~]$ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- amd.q@node005.cm.cluster BIP 0/25/32 24.44 lx26-amd64 --------------------------------------------------------------------------------- amd.q@node006.cm.cluster BIP 0/31/32 31.00 lx26-amd64 --------------------------------------------------------------------------------- amd.q@node007.cm.cluster BIP 0/32/32 31.93 lx26-amd64 --------------------------------------------------------------------------------- amd.q@node008.cm.cluster BIP 0/64/64 63.96 lx26-amd64 --------------------------------------------------------------------------------- amd.q@node009.cm.cluster BIP 0/64/64 63.88 lx26-amd64 --------------------------------------------------------------------------------- amd.q@node010.cm.cluster BIP 0/64/64 63.95 lx26-amd64 --------------------------------------------------------------------------------- amd.q@node011.cm.cluster BIP 0/56/64 51.01 lx26-amd64 --------------------------------------------------------------------------------- intel.q@node003.cm.cluster BP 0/8/8 8.12 lx26-amd64 --------------------------------------------------------------------------------- intel.q@node004.cm.cluster BP 0/7/8 7.96 lx26-amd64 --------------------------------------------------------------------------------- large.q@node001.cm.cluster BIP 0/26/32 21.86 lx26-amd64 --------------------------------------------------------------------------------- metabolomics.q@node002.cm.clus BP 0/0/8 0.01 lx26-amd64 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 756329 0.60208 align-L7R user qw 07/19/2014 16:51:22 12 [user@bert ~]$