Directing jobs to particular machines
Andreas posted today a short usage tip for people who need to direct their jobs to particular named execution hosts. It covers the syntax for referencing an execution host name after a queue name or wildcard character:
You can do this with the -q option: -q "*@comp28,*@comp29" or even shorter -q "*@comp28|comp29" with the -l option you can specify the host(s) like this -l h=comp28|comp29 if you have a large cluster with high throughput I recommend the -q being used since -q matchmaking is generally faster.
Note: One more additional syntax tip - remember your "@" vs. "@@" usage syntax:
-q all.q@comp28-- refer to a specific host-q all.q@@LinuxNodes-- refer to a specific hostgroup
Understanding queue error state 'E'
Working at my day job I usually handle SGE related questions from our customers and clients. This morning after responding to a support request concerning a SGE queue in state "E" I got curious and started trying to learn how often we had been asked this. It turns out that I've probably sent ~25-30 unique responses on this specific subject and each time my written response was different. This post is an attempt to create a single article that I can point people at as needed ...
Seeing "E" in the state column of qstat?
E state errors usually mean that an attempt to start a job failed in a spectacular manner and the Grid Engine qmaster decided to close off the queue instance to new jobs.
This is an important Grid Engine protective measure designed to keep your remaining pending jobs from a "black hole" draining effect in which they all successively get dispatched to the "bad" node die instantly with errors.
There are different causes to state E -- in most cases the root cause is is some large, systemic hardware or OS level error or misconfiguration. Typical examples include:
- The username of the job submitter does not exist on the execution host (extremely common)
- Shared filesystem failure
- Parallel jobs: syntax errors or bad commands in "start_proc_args" or "stop_proc_args" as defined within the parallel environment (PE)
- Serial jobs: syntax errors or a "prolog" or "epilog" script that does not exit with status code 0
- Serious path or path_alias problems (paths that exist on the submit host are different on remote execution host or have been improperly aliased
- Network, routing or DNS errors that are interfering with LDAP, NIS or DNS
I have seen a few cases of actual jobs crashing and causing queue instance state "E". Usually this seems to occur when the job itself has crashed and taken out its parent process (the 'sge_shepard' deamon). If your job is bombing bad enough to wipe out the parent sge_shepard process then SGE will usually toggle the queue instance into "E" state. This is still a fairly rare occurance so if you are trying to debug this situation I'd recommend first looking at Hardware and OS level issues before looking too closely at the job as a root-cause.
State "E" does not go away automatically
One big message to impart is that E states are persistent and never go away on their own (unlike many SGE queue and job states which clear automatically). State "E" will persist through hardware reboots and Grid Engine restart efforts. The state has to be manually be cleared by a Grid Engine administrator. Again, the reason for this is that SGE wants a human to investigate the root cause first in case there is potential for the "black hole" effect mentioned above.
If you think this was a transient problem you can clear the queues and see what happens with your pending jobs --- the command is "qmod -c (queue instance)".
To globally clear all E states in your SGE cluster:
qmod -c '*'
Troubleshooting and Diagnosing
- qstat -explain E
- Examine the node itself and OS logs with an eye towards entries relating to permissions, failures or access errors
- Try to login to the node in question using a username associated with a failed job. This will help diagnose any username, authentication or access issues
- Look in the job output directory if it is available. Output from failed jobs can be extremely useful, especially if there is a path, ENV or permission problem
- Examine the SGE logs with particular focus on the messages file created by the sge_exced on the execution host in question
- If all else fails, SGE daemons will write log files to /tmp when they can't write to their normal spool location. Seeing recent SGE event data in /tmp instead of your normal spool location is a good indication of filesystem or permission errors
I'll try to keep this page updated in the future with new information and troubleshooting hints
One day grid engine training seminar in Boston
Time for a brief commercial announcement. This is part of an ongoing personal experiment to see if there actually is demand for user and usage-centric Grid Engine training.
A 1-day seminar on "Grid Engine 6 Intro & Usage" will be offered on June 2, 2006 in the Boston area.
Intended Audience:
Anyone interested in a user-centric view of distributed computing with Grid Engine. Note: this is not an advanced operator/admin course.
Our goal is to help users, application integrators and developers understand Grid Engine features and capabilities in a way that allows them to become more productive at home. New SGE cluster operators or administrators also may benefit from the user and usage-centric perspective.
The seminar will be taught with a life science informatics focus, using bioinformatics workflows and applications as examples. As Grid Engine usage and configuration patterns can differ significantly between disciplines and industries, interested non-life-science attendees should contact us in advance to determine if this seminar will be a good fit.
The full announcement is at http://bioteam.net/dag/gridengine-training/
DRMAA picks up Python language bindings
DanT writes about DRMAA
Fresh from his move back to the USA, Dan has posted a couple of Sun blog entries on DRMAA and Grid Engine. DRMAA is a GGF API specification for "the submission and control of jobs to one or more Distributed Resource Management (DRM) systems". It is currently well supported with Grid Engine 6 and it seems that folks are busy with getting other systems to support DRMAA 1.0
The first of two recent DRMAA posts is titled "Porting the DRMAA Java Language Binding":
Dan says:"There's a been quite a bit of talk on various aliases (and over private email) recently about what's required to port the Grid Engine DRMAA JavaTM language binding to another DRM. Since that is an interesting topic, I figured I'd assemble all of the answers here for easy reference...(more) "
The second entry is titled "Running Job Scripts With DRMAA", this topic has been popping up quite a bit on the Grid Engine lists recently (1, 2):
Dan says:DRMAA is intended as a general purpose API, which means it has to assume as little as possible about the jobs it runs. Grid Engine recognizes two broad classes of jobs: scripts and binaries. A script is a text file that is to be run by a shell and which may have embedded SGE options in it. (Lines starting with #$ are parsed by Grid Engine at job submission time for embedded options. See the man page for more info.) A binary is anything else. The user controls whether a job is treated as a binary or a script with the -b (for binary) qsub option. The Grid Engine default is to assume that all jobs are scripts.
DRMAA, however, makes a different assumption. The minimum assumption that DRMAA can make is that jobs are opaque and cannot be parsed, i.e. that all jobs are binary. This assumption is exactly the opposite of the one Grid Engine makes. Because jobs aren't assumed to be scripts, there are a few extra steps required to running scripts through DRMAA...(more)
Appreciating grid engine 'man' pages
As is the case with many open source efforts, a rapid pace of development can often outpace the formal documentation efforts. This makes the SGE “man” pages critical resources for savvy users and administrators.
Why SGE man pages are so important
- The man page entries are written by the developers who wrote the code
- The man pages are maintained in the same CVS repository as the active gridengine codebase making updates, additions and corrections a simple matter.
- Nobody really knows how to update the official documentation or when a tech writer will be hired to revise it. Check out the amazing list of open documentation issues to see this for yourself.
- What this means is that although the formal documentation is of high quality it will certainly lag behind the man pages when it comes to documenting new features, fixes and behaviors.
Links to the most current Grid Engine manpages
The following list of man page links point directly to the current Grid Engine source repository.
Click on any of them to see the latest and greatest documentation for the manpage in question.
Section 1gethostbyaddr gethostbyname gethostname getservbyname hostnameutils qacct qalter qconf qdel qhold qhost qlogin qmake qmod qmon qping qresub qrls qrsh qselect qsh qstat qsub qtcsh sge_ckpt sge_intro sge_types sgepasswd submit
Section 3
drmaa_allocate_job_template drmaa_attributes drmaa_control drmaa_delete_job_template drmaa_exit drmaa_get_DRM_system drmaa_get_attribute drmaa_get_attribute_names drmaa_get_contact drmaa_get_next_attr_name drmaa_get_next_attr_value drmaa_get_next_job_id drmaa_get_vector_attribute drmaa_get_vector_attribute_names drmaa_init drmaa_job_ps drmaa_jobcontrol drmaa_jobtemplate drmaa_misc drmaa_release_attr_names drmaa_release_attr_values drmaa_release_job_ids drmaa_run_bulk_jobs drmaa_run_job drmaa_session drmaa_set_attribute drmaa_set_vector_attribute drmaa_strerror drmaa_submit drmaa_synchronize drmaa_version drmaa_wait drmaa_wcoredump drmaa_wexitstatus drmaa_wifaborted drmaa_wifexited drmaa_wifsignaled drmaa_wtermsig
Section 5
access_list accounting bootstrap calendar_conf checkpoint complex host_aliases host_conf hostgroup project qtask queue_conf reporting sched_conf sge_aliases sge_conf sge_pe sge_priority sge_qstat sge_request sgepasswd share_tree user usermapping sge_execd sge_qmaster sge_schedd sge_shadowd sge_shepherd
Using HostGroups and Cluster Queues
Charu writes:
A new document on Cluster Queues and Host Groups is available. Find it
under "Cluster Management" at the HOWTO page:
http://gridengine.sunsource.net/howto/how
to.html
or the direct link:
http://www.sun.com/blueprints/0805/819-31
65.html
The blueprint links to a 17 page PDF download which gives nice coverage of hostgroups, cluster queue s and the difference between “cluster queues” and “queue instance”. A good read, especially for new SGE administrators or users.

XML Feeds