I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal “why does SGE refuse to start?” issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:
mbgxsrv1:~ root# /common/sge/default/common/sgemaster start starting sge_qmaster starting sge_schedd error: commlib error: got read error (closing "mbgxsrv1.xxx.xxx.xxx/qmaster/1") error: commlib error: can't connect to service (Connection refused) error: getting configuration: unable to contact qmaster using port 701 on host "mbgxsrv1.xxx.xxx.xxx" error: can't get configuration from qmaster -- backgrounding mbgxsrv1:~ root#
It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.
This is where the use of classic spooling by the organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the “qmaster/cqueues/all.q” file and the “qmaster/hostgroups/@allhosts” files. The fix took a few minutes to effect and SGE started up instantly and without error.
What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default “berkeley spooling” mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database — inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.
This is why I’m a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for “performance” that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I’m talking about averages here …).
My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.
Read on for more details on this particular incident …
This is the qmaster messages file entry that clued us into a possible configuration state problem:
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "Your" 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on reading spool file 01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed for configuration attribute "qtype" 01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration attribute "group_name" 01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got NULL element for HGRP_name !!!!!!!!!!
This is what the spool file “qmaster/cqueues/all.q” looked like when I tried to view it in my terminal window:
This is what the queue configuration should have looked like:
name all.q hostlist @computeNodes seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich rerun FALSE slots 2 tmpdir /tmp shell /bin/csh prolog NONE ... snip ...
The "fix" consisted of these steps:
- Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place
- Manually edit the "new" @allhosts file to properly set the name and group members
- Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted
- Manually edit the new "all.q" file to properly set the qname and other parameters
After those minor hand-edits using the unix copy command and a text editor Grid Engine was able to start up fine. Overall the fix took about 10 minutes to implement once we identified the 2 corrupt files.