Open Source Grid & Cluster Conference Photostream
I'll post links to talk slides shortly, meanwhile a photo stream from the event can be found here:
http://flickr.com/groups/opensourcegridcluster/
SGE 6.2 beta binaries are available for testing
I'm not going to waste time copying the release announcement into a blog post. The full announcement can be read here:
http://gridengine.sunsource.net/servlets/ReadMsg?list=announce&msgNo=94
Lots of significant changes in the product itself. I also love the migration of manuals and docs to the new http://wikis.sun.com/display/GridEngine site.
Please remember that the reason for this beta release is to allow you to test 6.2 before it officially goes out the door in final form. The more people we have working on and stress-testing 6.2 the less chance there will be an inconvenient or unexpected upgrade issue, bug or glitch. The developers have good testbed environments and testsuites but they can't simulate all the different ways and methods that we use (and abuse!) SGE to get work done. Help make the 6.2 release a big success by testing now and providing feedback.
Testing flickr screencast hosting
I'm doing a short "Intro to SGE" tutorial today as part of the Univa ClusterExpress tutorial that is happening this week at the http://www.opensourcegridcluster.org/ conference.
As part of general paranoia I recorded some screencasts of trivial SGE command line usage to play at my talk if my demo system goes unavailable. Just for the heck of it I uploaded some of these screencasts to my Flickr photostream and then added them to the conference group photo pool .
Please let me know what you think by leaving a comment here or dropping me an email. I don't think the quality is all that great as it is hard to see the text all the time. I may go back to producing screencasts with Camtasia Studio and hosting them over at http://www.screencast.com (if you see the videos in the lower sidebar, those are done in camtasia and hosted at screencast.com).
SGE XML output getting some needed attention
For people like myself who are interested (or say, dependent) on the XML output features of Grid Engine it's been a lonely time. This area of Grid Engine was not really getting much love, attention or bug fixes until recently.
Happy to report that this seems to have changed. If you are at all interested in using SGE data in XML form then you may want to:
- Pay attention to this mailing list thread
- Watch this SGE Wiki page
Kudos to Michael Pospisil from the Sun Microsystems SGE developer team in Prague for soliciting and listening to community input -- looks like the change may be bigger than simple bug fixes and output normalization. There is some talk about making XML output more usable to the end-users instead of the current design where XML output is largely a straight representation of internal SGE Cull lists and data structures.
Roland: things that affect job deletion time
In this interesting users-list thread, Roland provides some nice comments on the various things that can affect the time it takes to delete a Grid Engine job.
Specifically mentioned is a new hash implementation slated for the upcoming 6.2 release that dramatically improves things.
From Roland's post:
...for GE 6.2 I've analyzed the hotspots deleting jobs and what I've found is:
1) the time deleting a job increases with the amount of pending jobs in the cluster and the amount of queue instances. The reason for this is the messages list for schedd_job_info. Every message in the qstat -j output is one list element and below this element are the job id references stored inheriting this message. At job deletion time qmaster has to loop over the whole list of messages and loop over all references to removes right one. As a matter of fact this does not scale, and for 6.2 I've added a hash access to the reference id that decreased the job deletion time in large clusters heavily. Sadly I don't remember the exact numbers.
To verify this you can disable schedd_job_info in the scheduler config and then delete your jobs.
2) The job script and the job itself needs to be removed from the database. This time depends if you use berkeleydb or classic spooling and if you spool on local storage or on a NFS share. As faster your access to the storage is as faster you can delete the jobs.
If disabling schedd_job_info doesn't help in your case you might be hit by this point.
3) With 6.1u3 we've introduced the parameters gdi_timeout and gdi_retries to tune this behaviour. But that's anyway more a workaround than a real solution.
Keeping single slot jobs off of certain nodes
In this thread, Paul asks:
"I'm looking at finding a way to either limit single-slot jobs, or requiring all jobs in a given queue to be running in a pe. Specifically, I have some SMP nodes, that I'd rather not waste on single thread, and also keep the single thread jobs off of the infiniband connected nodes. I have gigE small cpu count nodes for this task."
Dan replied with another example of clever use of the new SGE Resource Quota syntax within SGE 6.1 and later:
You can use resource quota sets to restrict non-PE jobs to certain queues hosts.limit pes !* hosts @smp to slots=0
Slick!
Think I'm going to like the new Sun wiki
One of the more interesting things (to me at least!) in the recent news about the SGE 6.2 beta was the word that all documentation and manuals would be moving to a new home at http://wikis.sun.com.
I registered a user account a few days ago and hit the site today to see if any SGE stuff had made it over. The screenshot below is what I found. It's nice to see smart tech people have a sense of humor. The downtime must have been short as when I I refreshed the browser the site was back to normal.
SGE 6.2 goes beta next week (your help needed)
SGE 6.2 is being released in Beta form next week and the developers are asking for people to make some time if possible to fully test out the beta snapshot of the latest major SGE point release.
Andy's full note can be found here (well worth reading in full ...):
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=24426
In my mind, I'm most excited about the following:
- Advance Reservations & array job inter-dependencies
- The scheduler is now a thread within the qmaster!
- The JVM running within the qmaster
- SGE moving all docs into wiki form!
SGE testbeds: Simulate mass numbers of exec hosts
Interesting message on the developers list recently as a comment attached to Issue 2364. Within, Andreas explains the use of SIMULATE_EXECDS=true parameter that allows unrestricted execution host creation (via suppressing unknown host errors).
I can see this as being very useful for testing SGE scheduler and policy configuration settings before implementing them on production systems.
From the comment:
This is a short HOWTO for the use of the cluster simulator: (1) Start with installing a new SGE cluster as used, but install not more than the qmaster itself (2) After successful installation use qconf -mconf to set SIMULATE_EXECDS=true in qmaster_params section of sge_conf(5). This causes the suppression of the 'unknown' queue states. (3) Make sure the "all.q" and any other queue that you configure does not use any 'load_threasholds'. Cluster simulator has no means to anyhow emulate load values. As a result there will be no load values. For that reason load_threasholds may not be used as it would cause load alarm queue states that prevent scheduler from dispatching jobs into your queues. (4) Use qconf -ae|-Ae to create arbitrary number of simulated execution hosts. The hosts needs not exist as qmaster anyways won't try to send anything to it, but the hostname must be resolvable. Optionally: (5) If you care for scheduler runtimes set PROFILE=true in the params section of sched_conf(5) using qconf -msconf. Now your simulated cluster is ready. You can send in arbitrary numbers of jobs. Due to (2) and (3) scheduler will dispatch them and send corresponding orders to qmaster. Qmaster will behave as if it would start the jobs, but it raise timers to ensure job state transitions are passed as used. What won't work is interactive jobs (i.e. qrsh, qsh etc.) and parallel jobs with control_slaves set to true in sge_pe(5). Jobs' runtime can be controled via the first job argument. That means when # qsub -b y /bin/sleep 5 is submitted, the job will finish after five seconds.
RHEL5.2/Centos5 kernel update may cause problems
This is a heads up for RedHat Enterprise Linux (RHEL) users as well as for users (like myself) of the various Centos variants.
There is a recent patch for RHEL that changes the inode data structure exposed to NFS clients from 32 bits to 64 bits in size. The basic summary of this issue is that many applications may not handle this change gracefully (such as one report with the SGE linux binaries.)
RHEL and modern Centos users should probably pay attention to (by subscribing as CC: contacts) to this issue:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2543
A RedHat bug report discussing the issue in more detail is here:
"Large inode number patch breaks applications"
https://bugzilla.redhat.com/show_bug.cgi?id=241348
mpiblast, SGE and MPICH2 integration
Matthias Neder has posted a quick summary of a tightly integrated MPICH2 integration that can successfully handle his mpiblast application integration.
The summarized solution can be found here:
http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=24204
6.2 Bringing Significant Improvements to Cluster Queue Matching
Interesting writeup from Andreas Hass reproduced in full below ...
I thought this could be of interest for those who care for dispatching times. This maintrunk check-in
http://gridengine.sunsource.net/servlets/ReadMsg?list=cvs&msgNo=9814
will improve the matching times for set-ups where queue resource limits such as -l h_rt or -l h_vmem are criterion whether a job gets into a queue or not.
Before the above change we had an exponential growth of dispatching times
04/15/2008 11:48:01|schedu|es-ergb01-01|P|PROF: job dispatching took 0.030 s (20 fast, 0 comp, 0 pe, 0 res) 04/15/2008 11:48:23|schedu|es-ergb01-01|P|PROF: job dispatching took 0.130 s (40 fast, 0 comp, 0 pe, 0 res) 04/15/2008 11:48:50|schedu|es-ergb01-01|P|PROF: job dispatching took 0.630 s (80 fast, 0 comp, 0 pe, 0 res) 04/15/2008 11:49:26|schedu|es-ergb01-01|P|PROF: job dispatching took 3.210 s (160 fast, 0 comp, 0 pe, 0 res)
now growth is linear
04/15/2008 11:53:54|schedu|es-ergb01-01|P|PROF: job dispatching took 0.000 s (20 fast, 0 comp, 0 pe, 0 res) 04/15/2008 11:54:17|schedu|es-ergb01-01|P|PROF: job dispatching took 0.020 s (40 fast, 0 comp, 0 pe, 0 res) 04/15/2008 11:54:44|schedu|es-ergb01-01|P|PROF: job dispatching took 0.050 s (80 fast, 0 comp, 0 pe, 0 res) 04/15/2008 11:55:16|schedu|es-ergb01-01|P|PROF: job dispatching took 0.070 s (160 fast, 0 comp, 0 pe, 0 res)
also note this maintrunk check-in
http://gridengine.sunsource.net/servlets/ReadMsg?list=cvs&msgNo=9713
when profiling is enabled in sched_conf(5) like this
: params PROFILE=true :
the actual cause for exponential/linear dispatching times becomes fairly obvious: Without the above improvement the scheduler did check each single queue instance also in cases when the entire cluster queue was not suited
6.1 leak found; schedd_job_info is not your friend
Anyone interested in the memory leak that has been bothering some 6.1 users should check out the comments associated with Issue #2464:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464
Among the interesting things you'll see are:
- A great example of motivated SGE users and developers working together to track down a hard to find problem
- Interesting comments on the potential "unfixible" (my words) nature of the schedd_job_info messages
- A really cool workaround for getting job scheduler messages with schedd_job_info=FALSE
In a nutshell, there is a problem in the schedd_job_info framework that can cause massive resource utilization on the qmaster machine. This happens in particular on larger systems or places with large numbers of queue instances. This can also pop up on systems with jobs that are pending due to un-fulfillable resource requests. This explains why I saw the memory leak on my small testbed cluster -- I have a number of "pend forever" jobs in the queue for demonstration purposes.
The fix is to disable schedd_job_info. This is potentially problematic though as that feature is pretty much my goto-first action for troubleshooting job dispatch problems.
However, in a recent update comment to this issue, andreas added a possible tip for getting scheduling messages about a job in a way that that puts far less load on the system AND does not require schedd_job_info=TRUE:
qalter -w v
Remember though that comments found in a bug report are not "gospel" so don't read this as news that schedd_job_info is forever broken or going away. Expect to see this and other issues discussed as part of the SGE Roadmap. You are attending the May 2008 SGE Workshop, right?
Release 6.1u4 is out
Congratulations to the SGE developer team!
Big news today -- 6.1u4 was just announced; hopefully addressing some persistent issues people have been having with the previous releases. The plaintext list of fixed issues can be found here:
http://gridengine.sunsource.net/project/gridengine/61patches.txt
The full announcement is here:
http://gridengine.sunsource.net/news/GE61u4-announce.html
I've been unable to keep 6.1u3 running consistently on a small test system, probably due to the same memory leak others have been reporting. There is a chance that a subtle leak still exists or at least has not been fully tracked down in 6.1u4 but multiple people are working diligently on this. Best bet is to monitor the users mailing list to see the feedback.
Summer 2008 SGE Training Workshops
Hi folks. It's done. I've made a personal and financial commitment to organize a regularly occurring series of Grid Engine Training Workshops starting initially in the Cambridge, Massachusetts area. This is a darwinian test to see if my thoughts about the size and needs of the Grid Engine community are true. With existing SGE training opportunities only scheduled 1-2 times per year it is still an open guess as to the size of the potential audience for these sorts of events.
Course details and updates will always be here:
http://blog.bioteam.net/category/training/
Obviously my employer has a bit of a commercial/profit motive here but I've been the one pushing to make this happen. Consider this a bit of market research to see if BioTeam should invest in growing the number of staff capable of providing SGE related training, professional services and support to the community at large.
I welcome your comments and feedback and would appreciate any assistance in spreading the word about these events.
Thanks! -- Chris.


XML Feeds