Screencast showing online upgrade to SGE 6.2
Lubomir Petrik has posted a screencast recording showing the SGE 6.x to SGE 6.2 upgrade process. Thanks to Andy for finding and reporting this.
Why upgrade? DanT explains SGE from 5.x through 6.2 and beyond
Dan has posted a great overview of how Grid Engine has changed since the version 5.x days, couched in the context of answering the "Why should I upgrade SGE?" questions that often come up.
I won't even excerpt it, the full article is well worth a read:
http://blogs.sun.com/templedf/entry/why_upgrade
SGE and MPICH2 On Windows/Linux Heterogenous Systems
Thanks to Jacek Strzelczyk for the new Wiki page entitled "Install and configure Grid Engine in heterogenic environment on Linux and Windows with MPICH2" that was posted earlier this week.
Creating Hadoop PE under Grid Engine
Dan has found a great Sun blog article by Ravi Chandra Nallan post on integrating Hadoop into SGE via the use of a parallel environment.

Image source: http://hadoop.apache.org/core/
Links:
SGE testbeds: Simulate mass numbers of exec hosts
Interesting message on the developers list recently as a comment attached to Issue 2364. Within, Andreas explains the use of SIMULATE_EXECDS=true parameter that allows unrestricted execution host creation (via suppressing unknown host errors).
I can see this as being very useful for testing SGE scheduler and policy configuration settings before implementing them on production systems.
From the comment:
This is a short HOWTO for the use of the cluster simulator: (1) Start with installing a new SGE cluster as used, but install not more than the qmaster itself (2) After successful installation use qconf -mconf to set SIMULATE_EXECDS=true in qmaster_params section of sge_conf(5). This causes the suppression of the 'unknown' queue states. (3) Make sure the "all.q" and any other queue that you configure does not use any 'load_threasholds'. Cluster simulator has no means to anyhow emulate load values. As a result there will be no load values. For that reason load_threasholds may not be used as it would cause load alarm queue states that prevent scheduler from dispatching jobs into your queues. (4) Use qconf -ae|-Ae to create arbitrary number of simulated execution hosts. The hosts needs not exist as qmaster anyways won't try to send anything to it, but the hostname must be resolvable. Optionally: (5) If you care for scheduler runtimes set PROFILE=true in the params section of sched_conf(5) using qconf -msconf. Now your simulated cluster is ready. You can send in arbitrary numbers of jobs. Due to (2) and (3) scheduler will dispatch them and send corresponding orders to qmaster. Qmaster will behave as if it would start the jobs, but it raise timers to ensure job state transitions are passed as used. What won't work is interactive jobs (i.e. qrsh, qsh etc.) and parallel jobs with control_slaves set to true in sge_pe(5). Jobs' runtime can be controled via the first job argument. That means when # qsub -b y /bin/sleep 5 is submitted, the job will finish after five seconds.
SGE testbeds: Simulate mass numbers of exec hosts
Interesting message on the developers list recently as a comment attached to Issue 2364. Within, Andreas explains the use of SIMULATE_EXECDS=true parameter that allows unrestricted execution host creation (via suppressing unknown host errors).
I can see this as being very useful for testing SGE scheduler and policy configuration settings before implementing them on production systems.
From the comment:
This is a short HOWTO for the use of the cluster simulator: (1) Start with installing a new SGE cluster as used, but install not more than the qmaster itself (2) After successful installation use qconf -mconf to set SIMULATE_EXECDS=true in qmaster_params section of sge_conf(5). This causes the suppression of the 'unknown' queue states. (3) Make sure the "all.q" and any other queue that you configure does not use any 'load_threasholds'. Cluster simulator has no means to anyhow emulate load values. As a result there will be no load values. For that reason load_threasholds may not be used as it would cause load alarm queue states that prevent scheduler from dispatching jobs into your queues. (4) Use qconf -ae|-Ae to create arbitrary number of simulated execution hosts. The hosts needs not exist as qmaster anyways won't try to send anything to it, but the hostname must be resolvable. Optionally: (5) If you care for scheduler runtimes set PROFILE=true in the params section of sched_conf(5) using qconf -msconf. Now your simulated cluster is ready. You can send in arbitrary numbers of jobs. Due to (2) and (3) scheduler will dispatch them and send corresponding orders to qmaster. Qmaster will behave as if it would start the jobs, but it raise timers to ensure job state transitions are passed as used. What won't work is interactive jobs (i.e. qrsh, qsh etc.) and parallel jobs with control_slaves set to true in sge_pe(5). Jobs' runtime can be controled via the first job argument. That means when # qsub -b y /bin/sleep 5 is submitted, the job will finish after five seconds.
6.1 leak found; schedd_job_info is not your friend
Anyone interested in the memory leak that has been bothering some 6.1 users should check out the comments associated with Issue #2464:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464
Among the interesting things you'll see are:
- A great example of motivated SGE users and developers working together to track down a hard to find problem
- Interesting comments on the potential "unfixible" (my words) nature of the schedd_job_info messages
- A really cool workaround for getting job scheduler messages with schedd_job_info=FALSE
In a nutshell, there is a problem in the schedd_job_info framework that can cause massive resource utilization on the qmaster machine. This happens in particular on larger systems or places with large numbers of queue instances. This can also pop up on systems with jobs that are pending due to un-fulfillable resource requests. This explains why I saw the memory leak on my small testbed cluster -- I have a number of "pend forever" jobs in the queue for demonstration purposes.
The fix is to disable schedd_job_info. This is potentially problematic though as that feature is pretty much my goto-first action for troubleshooting job dispatch problems.
However, in a recent update comment to this issue, andreas added a possible tip for getting scheduling messages about a job in a way that that puts far less load on the system AND does not require schedd_job_info=TRUE:
qalter -w v
Remember though that comments found in a bug report are not "gospel" so don't read this as news that schedd_job_info is forever broken or going away. Expect to see this and other issues discussed as part of the SGE Roadmap. You are attending the May 2008 SGE Workshop, right?
Grid Engine and Apple OS X Launchd
This is a follow-up post relating to the new Apple framework for starting, stopping and managing persistent daemons and services called "launchd". The issue of Grid Engine interoperability with the launchd framework has already been covered in a gridengine.info Wiki article.
The new news to report is that my coworker Bill Van Etten stumbled upon the SGE environment variable "SGE_ND" and realized that it could be useful for Apple launchd integration because launchd really hates daemons that fork off ASAP upon startup. By setting the "SGE_ND" variable to true, the daemons don't fork and can be better managed by launchd.
The new launchd scripts are discussed and available for download here:
http://blog.bioteam.net/2008/03/04/apple-os-x-105-launchd-scripts-for-grid-engine/
Feel free to use these scripts or simply refer to them when customizing your own. As always, feedback and comments would be appreciated. BioTeam remains committed to making sure SGE remains an excellent choice for use on OS X based systems.
qstat kung fu
A user posted to the list looking for an efficient way of probing the designated output directory of active (pending or running) jobs.
Once again, Reuti comes up with a nice suggestion, this time employing a shell one-liner that pipes the output of a wildcard "qstat -j '*'" query through awk:
$ qstat -j "*" | awk ' /^job_number:/ { job_number=$2 } /^sge_o_workdir:/ \
{ print job_number, $2 } '
And like the original poster mentioned, I also had no idea that wildcards could be used with the "-j" option to qstat. Thanks Reuti!
qstat kung fu
A user posted to the list looking for an efficient way of probing the designated output directory of active (pending or running) jobs.
Once again, Reuti comes up with a nice suggestion, this time employing a shell one-liner that pipes the output of a wildcard "qstat -j '*'" query through awk:
$ qstat -j "*" | awk ' /^job_number:/ { job_number=$2 } /^sge_o_workdir:/ \
{ print job_number, $2 } '
And like the original poster mentioned, I also had no idea that wildcards could be used with the "-j" option to qstat. Thanks Reuti!
tight MPICH2 integration broken with mpich2-1.0.6p1
If you are interested in tightly integrated MPICH2 environments, keep an eye on this mailing list thread. It seems that a current release (mpich2-1.06p1) has some sort of changed behavior that breaks the existing methods for tight integration as documented in the HOWTO.
Older versions of the mpich2 code (version 1.04p1) still seem to integrate without error.
Why I love classic spooling
I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "why does SGE refuse to start?" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:
mbgxsrv1:~ root# /common/sge/default/common/sgemaster start starting sge_qmaster starting sge_schedd error: commlib error: got read error (closing "mbgxsrv1.xxx.xxx.xxx/qmaster/1") error: commlib error: can't connect to service (Connection refused) error: getting configuration: unable to contact qmaster using port 701 on host "mbgxsrv1.xxx.xxx.xxx" error: can't get configuration from qmaster -- backgrounding mbgxsrv1:~ root#
It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.
This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.
What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.
This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...).
My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.
Read on for more details on this particular incident ...
Gory Details
This is the qmaster messages file entry that clued us into a possible configuration state problem:
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "Your" 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "qtype" 01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration attribute "group_name" 01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got NULL element for HGRP_name !!!!!!!!!!
This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:
This is what the queue configuration should have looked like:
name all.q hostlist @computeNodes seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich rerun FALSE slots 2 tmpdir /tmp shell /bin/csh prolog NONE ... snip ...
The "fix" consisted of these steps:
- Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
- Manually edit the "new" @allhosts file to properly set the name and group members
- Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
- Manually edit the new "all.q" file to properly set the qname and other parameters
Why I love classic spooling
I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "why does SGE refuse to start?" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:
mbgxsrv1:~ root# /common/sge/default/common/sgemaster start starting sge_qmaster starting sge_schedd error: commlib error: got read error (closing "mbgxsrv1.xxx.xxx.xxx/qmaster/1") error: commlib error: can't connect to service (Connection refused) error: getting configuration: unable to contact qmaster using port 701 on host "mbgxsrv1.xxx.xxx.xxx" error: can't get configuration from qmaster -- backgrounding mbgxsrv1:~ root#
It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.
This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.
What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.
This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...).
My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.
Read on for more details on this particular incident ...
Gory Details
This is the qmaster messages file entry that clued us into a possible configuration state problem:
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "Your" 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "qtype" 01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration attribute "group_name" 01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got NULL element for HGRP_name !!!!!!!!!!
This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:
This is what the queue configuration should have looked like:
name all.q hostlist @computeNodes seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich rerun FALSE slots 2 tmpdir /tmp shell /bin/csh prolog NONE ... snip ...
The "fix" consisted of these steps:
- Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
- Manually edit the "new" @allhosts file to properly set the name and group members
- Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
- Manually edit the new "all.q" file to properly set the qname and other parameters
Why I love classic spooling
I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "why does SGE refuse to start?" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:
mbgxsrv1:~ root# /common/sge/default/common/sgemaster start starting sge_qmaster starting sge_schedd error: commlib error: got read error (closing "mbgxsrv1.xxx.xxx.xxx/qmaster/1") error: commlib error: can't connect to service (Connection refused) error: getting configuration: unable to contact qmaster using port 701 on host "mbgxsrv1.xxx.xxx.xxx" error: can't get configuration from qmaster -- backgrounding mbgxsrv1:~ root#
It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.
This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.
What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.
This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...).
My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.
Read on for more details on this particular incident ...
Gory Details
This is the qmaster messages file entry that clued us into a possible configuration state problem:
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "Your" 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "qtype" 01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration attribute "group_name" 01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got NULL element for HGRP_name !!!!!!!!!!
This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:
This is what the queue configuration should have looked like:
name all.q hostlist @computeNodes seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich rerun FALSE slots 2 tmpdir /tmp shell /bin/csh prolog NONE ... snip ...
The "fix" consisted of these steps:
- Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
- Manually edit the "new" @allhosts file to properly set the name and group members
- Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
- Manually edit the new "all.q" file to properly set the qname and other parameters
Why I love classic spooling
I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "why does SGE refuse to start?" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:
mbgxsrv1:~ root# /common/sge/default/common/sgemaster start starting sge_qmaster starting sge_schedd error: commlib error: got read error (closing "mbgxsrv1.xxx.xxx.xxx/qmaster/1") error: commlib error: can't connect to service (Connection refused) error: getting configuration: unable to contact qmaster using port 701 on host "mbgxsrv1.xxx.xxx.xxx" error: can't get configuration from qmaster -- backgrounding mbgxsrv1:~ root#
It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.
This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.
What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.
This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...).
My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.
Read on for more details on this particular incident ...
Gory Details
This is the qmaster messages file entry that clued us into a possible configuration state problem:
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "Your" 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "qtype" 01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration attribute "group_name" 01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got NULL element for HGRP_name !!!!!!!!!!
This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:
This is what the queue configuration should have looked like:
name all.q hostlist @computeNodes seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich rerun FALSE slots 2 tmpdir /tmp shell /bin/csh prolog NONE ... snip ...
The "fix" consisted of these steps:
- Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
- Manually edit the "new" @allhosts file to properly set the name and group members
- Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
- Manually edit the new "all.q" file to properly set the qname and other parameters




XML Feeds