Screencast showing online upgrade to SGE 6.2

Posted by chris Thu, 21 Aug 2008 14:48:19 GMT

Lubomir Petrik has posted a screencast recording showing the SGE 6.x to SGE 6.2 upgrade process. Thanks to Andy for finding and reporting this.

Why upgrade? DanT explains SGE from 5.x through 6.2 and beyond

Posted by chris Fri, 18 Jul 2008 18:58:52 GMT

Dan has posted a great overview of how Grid Engine has changed since the version 5.x days, couched in the context of answering the "Why should I upgrade SGE?" questions that often come up.

I won't even excerpt it, the full article is well worth a read:
http://blogs.sun.com/templedf/entry/why_upgrade

SGE and MPICH2 On Windows/Linux Heterogenous Systems

Posted by chris Mon, 14 Jul 2008 23:03:29 GMT

Thanks to Jacek Strzelczyk for the new Wiki page entitled "Install and configure Grid Engine in heterogenic environment on Linux and Windows with MPICH2" that was posted earlier this week.

Creating Hadoop PE under Grid Engine

Posted by chris Fri, 23 May 2008 14:13:24 GMT

Dan has found a great Sun blog article by Ravi Chandra Nallan post on integrating Hadoop into SGE via the use of a parallel environment.


Image source: http://hadoop.apache.org/core/

Links:

SGE testbeds: Simulate mass numbers of exec hosts

Posted by chris Fri, 02 May 2008 12:42:00 GMT

Interesting message on the developers list recently as a comment attached to Issue 2364. Within, Andreas explains the use of SIMULATE_EXECDS=true parameter that allows unrestricted execution host creation (via suppressing unknown host errors).

I can see this as being very useful for testing SGE scheduler and policy configuration settings before implementing them on production systems.

From the comment:

This is a short HOWTO for the use of the cluster simulator:

(1) Start with installing a new SGE cluster as used, but
install not more than the qmaster itself

(2) After successful installation use qconf -mconf to set

    SIMULATE_EXECDS=true

in qmaster_params section of sge_conf(5). This causes the
suppression of the 'unknown' queue states.

(3) Make sure the "all.q" and any other queue that you
configure does not use any 'load_threasholds'. Cluster
simulator has no means to anyhow emulate load values. As a
result there will be no load values. For that reason
load_threasholds may not be used as it would cause load
alarm queue states that prevent scheduler from dispatching
jobs into your queues.

(4) Use qconf -ae|-Ae to create arbitrary number of
simulated execution hosts. The hosts needs not exist as
qmaster anyways won't try to send anything to it, but the
hostname must be resolvable.

Optionally:

(5) If you care for scheduler runtimes set

     PROFILE=true

in the params section of sched_conf(5) using qconf -msconf.

Now your simulated cluster is ready. You can send in
arbitrary numbers of jobs. Due to (2) and (3) scheduler will
dispatch them and send corresponding orders to qmaster.
Qmaster will behave as if it would start the jobs, but it
raise timers to ensure job state transitions are passed as
used. What won't work is interactive jobs (i.e. qrsh, qsh
etc.) and parallel jobs with control_slaves set to true in
sge_pe(5). Jobs' runtime can be controled via the first job
argument. That means when

# qsub -b y /bin/sleep 5

is submitted, the job will finish after five seconds.

SGE testbeds: Simulate mass numbers of exec hosts

Posted by chris Fri, 02 May 2008 12:42:00 GMT

Interesting message on the developers list recently as a comment attached to Issue 2364. Within, Andreas explains the use of SIMULATE_EXECDS=true parameter that allows unrestricted execution host creation (via suppressing unknown host errors).

I can see this as being very useful for testing SGE scheduler and policy configuration settings before implementing them on production systems.

From the comment:

This is a short HOWTO for the use of the cluster simulator:

(1) Start with installing a new SGE cluster as used, but
install not more than the qmaster itself

(2) After successful installation use qconf -mconf to set

    SIMULATE_EXECDS=true

in qmaster_params section of sge_conf(5). This causes the
suppression of the 'unknown' queue states.

(3) Make sure the "all.q" and any other queue that you
configure does not use any 'load_threasholds'. Cluster
simulator has no means to anyhow emulate load values. As a
result there will be no load values. For that reason
load_threasholds may not be used as it would cause load
alarm queue states that prevent scheduler from dispatching
jobs into your queues.

(4) Use qconf -ae|-Ae to create arbitrary number of
simulated execution hosts. The hosts needs not exist as
qmaster anyways won't try to send anything to it, but the
hostname must be resolvable.

Optionally:

(5) If you care for scheduler runtimes set

     PROFILE=true

in the params section of sched_conf(5) using qconf -msconf.

Now your simulated cluster is ready. You can send in
arbitrary numbers of jobs. Due to (2) and (3) scheduler will
dispatch them and send corresponding orders to qmaster.
Qmaster will behave as if it would start the jobs, but it
raise timers to ensure job state transitions are passed as
used. What won't work is interactive jobs (i.e. qrsh, qsh
etc.) and parallel jobs with control_slaves set to true in
sge_pe(5). Jobs' runtime can be controled via the first job
argument. That means when

# qsub -b y /bin/sleep 5

is submitted, the job will finish after five seconds.

6.1 leak found; schedd_job_info is not your friend

Posted by chris Thu, 10 Apr 2008 15:07:00 GMT

Anyone interested in the memory leak that has been bothering some 6.1 users should check out the comments associated with Issue #2464:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464

Among the interesting things you'll see are:

  • A great example of motivated SGE users and developers working together to track down a hard to find problem
  • Interesting comments on the potential "unfixible" (my words) nature of the schedd_job_info messages
  • A really cool workaround for getting job scheduler messages with schedd_job_info=FALSE

In a nutshell, there is a problem in the schedd_job_info framework that can cause massive resource utilization on the qmaster machine. This happens in particular on larger systems or places with large numbers of queue instances. This can also pop up on systems with jobs that are pending due to un-fulfillable resource requests. This explains why I saw the memory leak on my small testbed cluster -- I have a number of "pend forever" jobs in the queue for demonstration purposes.

The fix is to disable schedd_job_info. This is potentially problematic though as that feature is pretty much my goto-first action for troubleshooting job dispatch problems.

However, in a recent update comment to this issue, andreas added a possible tip for getting scheduling messages about a job in a way that that puts far less load on the system AND does not require schedd_job_info=TRUE:

qalter -w v  

Remember though that comments found in a bug report are not "gospel" so don't read this as news that schedd_job_info is forever broken or going away. Expect to see this and other issues discussed as part of the SGE Roadmap. You are attending the May 2008 SGE Workshop, right?

Grid Engine and Apple OS X Launchd

Posted by chris Tue, 04 Mar 2008 16:20:42 GMT

This is a follow-up post relating to the new Apple framework for starting, stopping and managing persistent daemons and services called "launchd". The issue of Grid Engine interoperability with the launchd framework has already been covered in a gridengine.info Wiki article.

The new news to report is that my coworker Bill Van Etten stumbled upon the SGE environment variable "SGE_ND" and realized that it could be useful for Apple launchd integration because launchd really hates daemons that fork off ASAP upon startup. By setting the "SGE_ND" variable to true, the daemons don't fork and can be better managed by launchd.

The new launchd scripts are discussed and available for download here:
http://blog.bioteam.net/2008/03/04/apple-os-x-105-launchd-scripts-for-grid-engine/

Feel free to use these scripts or simply refer to them when customizing your own. As always, feedback and comments would be appreciated. BioTeam remains committed to making sure SGE remains an excellent choice for use on OS X based systems.

qstat kung fu

Posted by chris Mon, 04 Feb 2008 22:45:36 GMT

A user posted to the list looking for an efficient way of probing the designated output directory of active (pending or running) jobs.

Once again, Reuti comes up with a nice suggestion, this time employing a shell one-liner that pipes the output of a wildcard "qstat -j '*'" query through awk:

$ qstat -j "*" | awk ' /^job_number:/ { job_number=$2 } /^sge_o_workdir:/  \
{ print job_number, $2 } '

And like the original poster mentioned, I also had no idea that wildcards could be used with the "-j" option to qstat. Thanks Reuti!

qstat kung fu

Posted by chris Mon, 04 Feb 2008 22:45:36 GMT

A user posted to the list looking for an efficient way of probing the designated output directory of active (pending or running) jobs.

Once again, Reuti comes up with a nice suggestion, this time employing a shell one-liner that pipes the output of a wildcard "qstat -j '*'" query through awk:

$ qstat -j "*" | awk ' /^job_number:/ { job_number=$2 } /^sge_o_workdir:/  \
{ print job_number, $2 } '

And like the original poster mentioned, I also had no idea that wildcards could be used with the "-j" option to qstat. Thanks Reuti!

tight MPICH2 integration broken with mpich2-1.0.6p1

Posted by chris Fri, 25 Jan 2008 13:53:56 GMT

If you are interested in tightly integrated MPICH2 environments, keep an eye on this mailing list thread. It seems that a current release (mpich2-1.06p1) has some sort of changed behavior that breaks the existing methods for tight integration as documented in the HOWTO.

Older versions of the mpich2 code (version 1.04p1) still seem to integrate without error.

Why I love classic spooling

Posted by chris Thu, 24 Jan 2008 17:33:14 GMT

I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "why does SGE refuse to start?" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:

mbgxsrv1:~ root# /common/sge/default/common/sgemaster start
   starting sge_qmaster
   starting sge_schedd
error: commlib error: got read error (closing  
"mbgxsrv1.xxx.xxx.xxx/qmaster/1")
error: commlib error: can't connect to service (Connection refused)
error: getting configuration: unable to contact qmaster using port  
701 on host "mbgxsrv1.xxx.xxx.xxx"
error: can't get configuration from qmaster -- backgrounding
mbgxsrv1:~ root#

It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.

This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.

What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.

This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...).

My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.

Read on for more details on this particular incident ...

Gory Details

This is the qmaster messages file entry that clued us into a possible configuration state problem:

01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "Your"
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "qtype"
01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration  
attribute "group_name"
01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got  
NULL element for HGRP_name !!!!!!!!!!

This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:

This is what the queue configuration should have looked like:

name              all.q
hostlist           @computeNodes
seq_no             0
load_thresholds    np_load_avg=1.75
suspend_thresholds NONE
nsuspend           1
suspend_interval   00:05:00
priority           0
min_cpu_interval   00:05:00
processors         UNDEFINED
qtype              BATCH INTERACTIVE
ckpt_list          NONE
pe_list            make mpich
rerun              FALSE
slots              2
tmpdir             /tmp
shell              /bin/csh
prolog             NONE

 ... snip ...

The "fix" consisted of these steps:

  1. Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
  2. Manually edit the "new" @allhosts file to properly set the name and group members
  3. Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
  4. Manually edit the new "all.q" file to properly set the qname and other parameters
After those minor hand-edits using the unix copy command and a text editor Grid Engine was able to start up fine. Overall the fix took about 10 minutes to implement once we identified the 2 corrupt files.

Why I love classic spooling

Posted by chris Thu, 24 Jan 2008 17:33:14 GMT

I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "why does SGE refuse to start?" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:

mbgxsrv1:~ root# /common/sge/default/common/sgemaster start
   starting sge_qmaster
   starting sge_schedd
error: commlib error: got read error (closing  
"mbgxsrv1.xxx.xxx.xxx/qmaster/1")
error: commlib error: can't connect to service (Connection refused)
error: getting configuration: unable to contact qmaster using port  
701 on host "mbgxsrv1.xxx.xxx.xxx"
error: can't get configuration from qmaster -- backgrounding
mbgxsrv1:~ root#

It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.

This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.

What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.

This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...).

My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.

Read on for more details on this particular incident ...

Gory Details

This is the qmaster messages file entry that clued us into a possible configuration state problem:

01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "Your"
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "qtype"
01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration  
attribute "group_name"
01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got  
NULL element for HGRP_name !!!!!!!!!!

This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:

This is what the queue configuration should have looked like:

name              all.q
hostlist           @computeNodes
seq_no             0
load_thresholds    np_load_avg=1.75
suspend_thresholds NONE
nsuspend           1
suspend_interval   00:05:00
priority           0
min_cpu_interval   00:05:00
processors         UNDEFINED
qtype              BATCH INTERACTIVE
ckpt_list          NONE
pe_list            make mpich
rerun              FALSE
slots              2
tmpdir             /tmp
shell              /bin/csh
prolog             NONE

 ... snip ...

The "fix" consisted of these steps:

  1. Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
  2. Manually edit the "new" @allhosts file to properly set the name and group members
  3. Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
  4. Manually edit the new "all.q" file to properly set the qname and other parameters
After those minor hand-edits using the unix copy command and a text editor Grid Engine was able to start up fine. Overall the fix took about 10 minutes to implement once we identified the 2 corrupt files.

Why I love classic spooling

Posted by chris Thu, 24 Jan 2008 17:33:14 GMT

I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "why does SGE refuse to start?" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:

mbgxsrv1:~ root# /common/sge/default/common/sgemaster start
   starting sge_qmaster
   starting sge_schedd
error: commlib error: got read error (closing  
"mbgxsrv1.xxx.xxx.xxx/qmaster/1")
error: commlib error: can't connect to service (Connection refused)
error: getting configuration: unable to contact qmaster using port  
701 on host "mbgxsrv1.xxx.xxx.xxx"
error: can't get configuration from qmaster -- backgrounding
mbgxsrv1:~ root#

It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.

This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.

What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.

This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...).

My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.

Read on for more details on this particular incident ...

Gory Details

This is the qmaster messages file entry that clued us into a possible configuration state problem:

01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "Your"
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "qtype"
01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration  
attribute "group_name"
01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got  
NULL element for HGRP_name !!!!!!!!!!

This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:

This is what the queue configuration should have looked like:

name              all.q
hostlist           @computeNodes
seq_no             0
load_thresholds    np_load_avg=1.75
suspend_thresholds NONE
nsuspend           1
suspend_interval   00:05:00
priority           0
min_cpu_interval   00:05:00
processors         UNDEFINED
qtype              BATCH INTERACTIVE
ckpt_list          NONE
pe_list            make mpich
rerun              FALSE
slots              2
tmpdir             /tmp
shell              /bin/csh
prolog             NONE

 ... snip ...

The "fix" consisted of these steps:

  1. Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
  2. Manually edit the "new" @allhosts file to properly set the name and group members
  3. Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
  4. Manually edit the new "all.q" file to properly set the qname and other parameters
After those minor hand-edits using the unix copy command and a text editor Grid Engine was able to start up fine. Overall the fix took about 10 minutes to implement once we identified the 2 corrupt files.

Why I love classic spooling

Posted by chris Thu, 24 Jan 2008 17:33:14 GMT

I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "why does SGE refuse to start?" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:

mbgxsrv1:~ root# /common/sge/default/common/sgemaster start
   starting sge_qmaster
   starting sge_schedd
error: commlib error: got read error (closing  
"mbgxsrv1.xxx.xxx.xxx/qmaster/1")
error: commlib error: can't connect to service (Connection refused)
error: getting configuration: unable to contact qmaster using port  
701 on host "mbgxsrv1.xxx.xxx.xxx"
error: can't get configuration from qmaster -- backgrounding
mbgxsrv1:~ root#

It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.

This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.

What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.

This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...).

My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.

Read on for more details on this particular incident ...

Gory Details

This is the qmaster messages file entry that clued us into a possible configuration state problem:

01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "Your"
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "qtype"
01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration  
attribute "group_name"
01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got  
NULL element for HGRP_name !!!!!!!!!!

This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:

This is what the queue configuration should have looked like:

name              all.q
hostlist           @computeNodes
seq_no             0
load_thresholds    np_load_avg=1.75
suspend_thresholds NONE
nsuspend           1
suspend_interval   00:05:00
priority           0
min_cpu_interval   00:05:00
processors         UNDEFINED
qtype              BATCH INTERACTIVE
ckpt_list          NONE
pe_list            make mpich
rerun              FALSE
slots              2
tmpdir             /tmp
shell              /bin/csh
prolog             NONE

 ... snip ...

The "fix" consisted of these steps:

  1. Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
  2. Manually edit the "new" @allhosts file to properly set the name and group members
  3. Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
  4. Manually edit the new "all.q" file to properly set the qname and other parameters
After those minor hand-edits using the unix copy command and a text editor Grid Engine was able to start up fine. Overall the fix took about 10 minutes to implement once we identified the 2 corrupt files.

Older posts: 1 2 3