Directing jobs to particular machines

Posted by chris Thu, 14 Feb 2008 22:54:16 GMT

Andreas posted today a short usage tip for people who need to direct their jobs to particular named execution hosts. It covers the syntax for referencing an execution host name after a queue name or wildcard character:

You can do this with the -q option:

   -q "*@comp28,*@comp29"

or even shorter

  -q "*@comp28|comp29"

with the -l option you can specify the host(s) like this

  -l h=comp28|comp29

if you have a large cluster with high throughput I recommend
the -q being used since -q matchmaking is generally faster.

Note: One more additional syntax tip - remember your "@" vs. "@@" usage syntax:

  • -q all.q@comp28 -- refer to a specific host
  • -q all.q@@LinuxNodes -- refer to a specific hostgroup

Understanding queue error state 'E'

Posted by chris Sun, 20 Jan 2008 14:20:00 GMT

Working at my day job I usually handle SGE related questions from our customers and clients. This morning after responding to a support request concerning a SGE queue in state "E" I got curious and started trying to learn how often we had been asked this. It turns out that I've probably sent ~25-30 unique responses on this specific subject and each time my written response was different. This post is an attempt to create a single article that I can point people at as needed ...

Seeing "E" in the state column of qstat?

E state errors usually mean that an attempt to start a job failed in a spectacular manner and the Grid Engine qmaster decided to close off the queue instance to new jobs.

This is an important Grid Engine protective measure designed to keep your remaining pending jobs from a "black hole" draining effect in which they all successively get dispatched to the "bad" node die instantly with errors.

There are different causes to state E -- in most cases the root cause is is some large, systemic hardware or OS level error or misconfiguration. Typical examples include:

  • The username of the job submitter does not exist on the execution host (extremely common)
  • Shared filesystem failure
  • Parallel jobs: syntax errors or bad commands in "start_proc_args" or "stop_proc_args" as defined within the parallel environment (PE)
  • Serial jobs: syntax errors or a "prolog" or "epilog" script that does not exit with status code 0
  • Serious path or path_alias problems (paths that exist on the submit host are different on remote execution host or have been improperly aliased
  • Network, routing or DNS errors that are interfering with LDAP, NIS or DNS

I have seen a few cases of actual jobs crashing and causing queue instance state "E". Usually this seems to occur when the job itself has crashed and taken out its parent process (the 'sge_shepard' deamon). If your job is bombing bad enough to wipe out the parent sge_shepard process then SGE will usually toggle the queue instance into "E" state. This is still a fairly rare occurance so if you are trying to debug this situation I'd recommend first looking at Hardware and OS level issues before looking too closely at the job as a root-cause.

State "E" does not go away automatically

One big message to impart is that E states are persistent and never go away on their own (unlike many SGE queue and job states which clear automatically). State "E" will persist through hardware reboots and Grid Engine restart efforts. The state has to be manually be cleared by a Grid Engine administrator. Again, the reason for this is that SGE wants a human to investigate the root cause first in case there is potential for the "black hole" effect mentioned above.

If you think this was a transient problem you can clear the queues and see what happens with your pending jobs --- the command is "qmod -c (queue instance)".

To globally clear all E states in your SGE cluster:

qmod -c '*'

Troubleshooting and Diagnosing

  • qstat -explain E
  • Examine the node itself and OS logs with an eye towards entries relating to permissions, failures or access errors
  • Try to login to the node in question using a username associated with a failed job. This will help diagnose any username, authentication or access issues
  • Look in the job output directory if it is available. Output from failed jobs can be extremely useful, especially if there is a path, ENV or permission problem
  • Examine the SGE logs with particular focus on the messages file created by the sge_exced on the execution host in question
  • If all else fails, SGE daemons will write log files to /tmp when they can't write to their normal spool location. Seeing recent SGE event data in /tmp instead of your normal spool location is a good indication of filesystem or permission errors

I'll try to keep this page updated in the future with new information and troubleshooting hints

One day grid engine training seminar in Boston

Posted by chris Fri, 12 May 2006 00:56:53 GMT

Time for a brief commercial announcement. This is part of an ongoing personal experiment to see if there actually is demand for user and usage-centric Grid Engine training.

A 1-day seminar on "Grid Engine 6 Intro & Usage" will be offered on June 2, 2006 in the Boston area.

Intended Audience:

Anyone interested in a user-centric view of distributed computing with Grid Engine. Note: this is not an advanced operator/admin course.

Our goal is to help users, application integrators and developers understand Grid Engine features and capabilities in a way that allows them to become more productive at home. New SGE cluster operators or administrators also may benefit from the user and usage-centric perspective.

The seminar will be taught with a life science informatics focus, using bioinformatics workflows and applications as examples. As Grid Engine usage and configuration patterns can differ significantly between disciplines and industries, interested non-life-science attendees should contact us in advance to determine if this seminar will be a good fit.

The full announcement is at http://bioteam.net/dag/gridengine-training/

DRMAA picks up Python language bindings

Posted by chris Thu, 27 Oct 2005 17:05:03 GMT

Complimenting the DRMAA Perl binding from Tim Harsch is a new DRMAA Python module by Enrico Sirola. This has been discussed on the lists (1, 2) and in a post on Dan's blog.

DanT writes about DRMAA

Posted by chris Tue, 25 Oct 2005 23:33:40 GMT

Fresh from his move back to the USA, Dan has posted a couple of Sun blog entries on DRMAA and Grid Engine. DRMAA is a GGF API specification for "the submission and control of jobs to one or more Distributed Resource Management (DRM) systems". It is currently well supported with Grid Engine 6 and it seems that folks are busy with getting other systems to support DRMAA 1.0

The first of two recent DRMAA posts is titled "Porting the DRMAA Java Language Binding":

Dan says:"There's a been quite a bit of talk on various aliases (and over private email) recently about what's required to port the Grid Engine DRMAA JavaTM language binding to another DRM. Since that is an interesting topic, I figured I'd assemble all of the answers here for easy reference...(more) "

The second entry is titled "Running Job Scripts With DRMAA", this topic has been popping up quite a bit on the Grid Engine lists recently (1, 2):

Dan says:DRMAA is intended as a general purpose API, which means it has to assume as little as possible about the jobs it runs. Grid Engine recognizes two broad classes of jobs: scripts and binaries. A script is a text file that is to be run by a shell and which may have embedded SGE options in it. (Lines starting with #$ are parsed by Grid Engine at job submission time for embedded options. See the man page for more info.) A binary is anything else. The user controls whether a job is treated as a binary or a script with the -b (for binary) qsub option. The Grid Engine default is to assume that all jobs are scripts.

DRMAA, however, makes a different assumption. The minimum assumption that DRMAA can make is that jobs are opaque and cannot be parsed, i.e. that all jobs are binary. This assumption is exactly the opposite of the one Grid Engine makes. Because jobs aren't assumed to be scripts, there are a few extra steps required to running scripts through DRMAA...(more)

Appreciating grid engine 'man' pages

Posted by chris Thu, 22 Sep 2005 21:09:00 GMT

As is the case with many open source efforts, a rapid pace of development can often outpace the formal documentation efforts. This makes the SGE “man” pages critical resources for savvy users and administrators.

Why SGE man pages are so important

  • The man page entries are written by the developers who wrote the code

  • The man pages are maintained in the same CVS repository as the active gridengine codebase making updates, additions and corrections a simple matter.

  • Nobody really knows how to update the official documentation or when a tech writer will be hired to revise it. Check out the amazing list of open documentation issues to see this for yourself.

  • What this means is that although the formal documentation is of high quality it will certainly lag behind the man pages when it comes to documenting new features, fixes and behaviors.

Links to the most current Grid Engine manpages

The following list of man page links point directly to the current Grid Engine source repository.

Click on any of them to see the latest and greatest documentation for the manpage in question.

Section 1
gethostbyaddr   gethostbyname   gethostname   getservbyname   hostnameutils   qacct   qalter   qconf   qdel   qhold   qhost   qlogin   qmake   qmod   qmon   qping   qresub   qrls   qrsh   qselect   qsh   qstat   qsub   qtcsh   sge_ckpt   sge_intro   sge_types   sgepasswd   submit  

Section 3
drmaa_allocate_job_template   drmaa_attributes   drmaa_control   drmaa_delete_job_template   drmaa_exit   drmaa_get_DRM_system   drmaa_get_attribute   drmaa_get_attribute_names   drmaa_get_contact   drmaa_get_next_attr_name   drmaa_get_next_attr_value   drmaa_get_next_job_id   drmaa_get_vector_attribute   drmaa_get_vector_attribute_names   drmaa_init   drmaa_job_ps   drmaa_jobcontrol   drmaa_jobtemplate   drmaa_misc   drmaa_release_attr_names   drmaa_release_attr_values   drmaa_release_job_ids   drmaa_run_bulk_jobs   drmaa_run_job   drmaa_session   drmaa_set_attribute   drmaa_set_vector_attribute   drmaa_strerror   drmaa_submit   drmaa_synchronize   drmaa_version   drmaa_wait   drmaa_wcoredump   drmaa_wexitstatus   drmaa_wifaborted   drmaa_wifexited   drmaa_wifsignaled   drmaa_wtermsig  

Section 5
access_list   accounting   bootstrap   calendar_conf   checkpoint   complex   host_aliases   host_conf   hostgroup   project   qtask   queue_conf   reporting   sched_conf   sge_aliases   sge_conf   sge_pe   sge_priority   sge_qstat   sge_request   sgepasswd   share_tree   user   usermapping   sge_execd   sge_qmaster   sge_schedd   sge_shadowd   sge_shepherd  

Using HostGroups and Cluster Queues

Posted by chris Mon, 19 Sep 2005 18:00:00 GMT


Charu writes:

A new document on Cluster Queues and Host Groups is available. Find it under "Cluster Management" at the HOWTO page: http://gridengine.sunsource.net/howto/how to.html or the direct link: http://www.sun.com/blueprints/0805/819-31 65.html

The blueprint links to a 17 page PDF download which gives nice coverage of hostgroups, cluster queue s and the difference between “cluster queues” and “queue instance”. A good read, especially for new SGE administrators or users.