|
|
(3 intermediate revisions by one other user not shown) |
Line 1: |
Line 1: |
| There are various ways for you to monitor and check up on your running and completed jobs. | | There are various ways for you to monitor and check up on your running and completed jobs. |
| | | |
− | '''See the status of the nodes'''
| |
| | | |
− | The easiest way to see what is happening on the cluster is to firstly check ganglia. This is a web based monitoring application that displays statistics about the cluster and its nodes. To view this, simply visit;
| + | === Check on you've submitted === |
− | | |
− | [http://bert.ibers.aber.ac.uk/ganglia http://bert.ibers.aber.ac.uk/ganglia]
| |
− | | |
− | There are a variety of statistics to view. Most useful is probably <nowiki>load_one</nowiki> which shows you the cpu load average on each node. You can also monitor the overall averages along with memory and network usage.
| |
− | | |
− | '''Check on you've submitted'''
| |
| | | |
| Once you have submitted your job scripts, you may want to check on the progress of what is running. This is achieved using the <nowiki>qstat</nowiki> command. This will show you your jobs. It might look something like; | | Once you have submitted your job scripts, you may want to check on the progress of what is running. This is achieved using the <nowiki>qstat</nowiki> command. This will show you your jobs. It might look something like; |
| | | |
| <nowiki> | | <nowiki> |
− | [user@bert ~]$ qstat | + | [user@login01(aber) ~]$ squeue |
− | job-ID prior name user state submit/start at queue slots ja-task-ID
| + | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
− | -----------------------------------------------------------------------------------------------------------------
| + | 200133 amd myScript cos R 0:03 1 node008 |
− | 758061 0.50042 k2bRC-a1.i user r 07/20/2014 14:13:33 amd.q@node010.cm.cluster 1 | + | 200134 amd myScript cos R 0:01 1 node008 |
− | 758062 0.50042 k2bRC-a2.i user r 07/20/2014 14:13:33 amd.q@node009.cm.cluster 1 | + | 200135 amd myScript cos R 0:01 1 node008 |
− | 758063 0.50042 k2bRC-a3.i user r 07/20/2014 14:13:48 amd.q@node009.cm.cluster 1 | + | 200136 amd myScript cos R 0:01 1 node008 |
− | 758064 0.50042 k2bRC-a4.i user r 07/20/2014 14:13:48 amd.q@node008.cm.cluster 1 | + | 200137 amd myScript cos R 0:01 1 node008 |
− | 758065 0.50042 k2bRC-a5.i user qw 07/20/2014 14:14:03 1 | + | 200138 amd myScript cos R 0:02 1 node008 |
− | 758066 0.60208 k2bRC-a6.i user qw 07/20/2014 14:14:18 1 | + | 200139 amd myScript cos R 0:02 1 node008 |
− | </nowiki>
| + | 200140 amd myScript cos R 0:02 1 node008 |
− | | + | |
− | | |
− | '''Check a the status of a job'''
| |
− | | |
− | You can use the <nowiki>qstat -j JOB_ID</nowiki> command to get information about a running or queued job. Below is what you might find on a running job.
| |
− | | |
− | <nowiki>
| |
− | [user@bert ~]$ qstat -j 758061
| |
− | ==============================================================
| |
− | job_number: 758061
| |
− | exec_file: job_scripts/758061
| |
− | submission_time: Sun Jul 20 14:13:32 2014
| |
− | owner: user
| |
− | uid: 100000
| |
− | group: users
| |
− | gid: 100000
| |
− | sge_o_home: /ibers/ernie/home/user/
| |
− | sge_o_log_name: user
| |
− | sge_o_path: /ibers/ernie/home/user/perl5/bin
| |
− | sge_o_shell: /bin/bash
| |
− | sge_o_workdir: /ibers/ernie/scratch/user/CGR/dots
| |
− | sge_o_host: bert
| |
− | account: sge
| |
− | cwd: /ibers/ernie/scratch/user/CGR/dots
| |
− | stderr_path_list: NONE:NONE:k2bRC-a1.e
| |
− | hard resource_list: h_stack=512m,h_vmem=20.0G
| |
− | mail_list: user@bert.cm.cluster
| |
− | notify: FALSE
| |
− | job_name: k2bRC-a1.i
| |
− | stdout_path_list: NONE:NONE:k2bRC-a1.o
| |
− | jobshare: 0
| |
− | hard_queue_list: amd.q
| |
− | script_file: k2bRC-a1.i
| |
− | usage 1: cpu=22:21:14, mem=967478.88534 GBs, io=4.91085, vmem=13.401G, maxvmem=13.401G
| |
− | scheduling info: queue instance "intel.q@node003.cm.cluster" dropped because it is full
| |
− | queue instance "amd.q@node008.cm.cluster" dropped because it is full
| |
− | queue instance "intel.q@node004.cm.cluster" dropped because it is full
| |
− | queue instance "amd.q@node009.cm.cluster" dropped because it is full
| |
− | queue instance "amd.q@node007.cm.cluster" dropped because it is full
| |
− | queue instance "amd.q@node010.cm.cluster" dropped because it is full
| |
− | | |
| </nowiki> | | </nowiki> |
| | | |
| | | |
− | '''Figuring out why you're job is still in the queue'''
| + | === Check a the status of a job === |
| | | |
− | You may find it difficult to get your job running and find that it is queued after some time. To work out why this is, first take a look at the job in the queue using the <nowiki>qstat -j JOB_ID</nowiki> command. e.g. | + | You can use the <nowiki>squeue -j JOB_ID</nowiki> command to get information about a running or queued job. Below is what you might find on a running job. |
| | | |
| <nowiki> | | <nowiki> |
− | [mjv08@bert ~]$ qstat -j 756329 | + | [user@bert ~]$ qstat -j 200133 |
− | ==============================================================
| + | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
− | job_number: 756329
| + | 200133 amd myScript cos R 0:03 1 node008 |
− | exec_file: job_scripts/756329
| |
− | submission_time: Fri Jul 18 11:33:07 2014
| |
− | owner: user
| |
− | uid: 10000
| |
− | group: users
| |
− | gid: 10000
| |
− | sge_o_home: /ibers/ernie/home/user
| |
− | sge_o_log_name: user
| |
− | sge_o_shell: /bin/bash
| |
− | sge_o_workdir: /ibers/ernie/scratch/user/data
| |
− | sge_o_host: bert
| |
− | account: sge
| |
− | cwd: /ibers/ernie/scratch/user/data
| |
− | stderr_path_list: NONE:NONE:y
| |
− | merge: y
| |
− | hard resource_list: h_stack=512m,h_vmem=40G
| |
− | mail_list: user@aber.ac.uk
| |
− | notify: FALSE
| |
− | job_name: align-L7R
| |
− | jobshare: 0
| |
− | hard_queue_list: amd.q
| |
− | shell_list: NONE:/bin/sh
| |
− | job_args: /ibers/ernie/home/user/stuff
| |
− | script_file: align-all.sh
| |
− | parallel environment: multithread range: 12
| |
− | scheduling info: queue instance "intel.q@node004.cm.cluster" dropped because it is full
| |
− | queue instance "intel.q@node003.cm.cluster" dropped because it is full
| |
− | queue instance "amd.q@node010.cm.cluster" dropped because it is full
| |
− | queue instance "amd.q@node008.cm.cluster" dropped because it is full
| |
− | queue instance "amd.q@node007.cm.cluster" dropped because it is full
| |
− | queue instance "amd.q@node009.cm.cluster" dropped because it is full
| |
− | cannot run in queue "large.q" because it is not contained in its hard queue list (-q)
| |
− | cannot run in queue "metabolomics.q" because it is not contained in its hard queue list (-q)
| |
− | cannot run in PE "multithread" because it only offers 0 slots
| |
− | | |
| </nowiki> | | </nowiki> |
| | | |
− | At the bottom you can see that the job has been submitted to the amd queue and 12 CPU multithread cores have been requested. In the scheduling information it tells you that 12 slots are not available.
| |
| | | |
− | You can see the availability of slots using the <nowiki>qstat -f</nowiki> command.
| + | === job States === |
| | | |
− | <nowiki>
| + | R = Job is running |
− | [user@bert ~]$ qstat -f
| + | PD = Job is waiting to run |
− | queuename qtype resv/used/tot. load_avg arch states
| + | CG = Job is completing |
− | ---------------------------------------------------------------------------------
| |
− | amd.q@node005.cm.cluster BIP 0/25/32 24.44 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | amd.q@node006.cm.cluster BIP 0/31/32 31.00 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | amd.q@node007.cm.cluster BIP 0/32/32 31.93 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | amd.q@node008.cm.cluster BIP 0/64/64 63.96 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | amd.q@node009.cm.cluster BIP 0/64/64 63.88 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | amd.q@node010.cm.cluster BIP 0/64/64 63.95 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | amd.q@node011.cm.cluster BIP 0/56/64 51.01 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | intel.q@node003.cm.cluster BP 0/8/8 8.12 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | intel.q@node004.cm.cluster BP 0/7/8 7.96 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | large.q@node001.cm.cluster BIP 0/26/32 21.86 lx26-amd64
| |
− | ---------------------------------------------------------------------------------
| |
− | metabolomics.q@node002.cm.clus BP 0/0/8 0.01 lx26-amd64
| |
− | | |
− | ############################################################################
| |
− | - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
| |
− | ############################################################################
| |
− | 756329 0.60208 align-L7R user qw 07/19/2014 16:51:22 12
| |
− | [user@bert ~]$
| |
− | </nowiki>
| |