Building Open Grid Scheduler 2011.11 on CentOS or RedHat 6

Posted by – January 12, 2012

Quick link back to a blog post I just wrote on behalf of my corporate overlords…

I pretty exhaustively described the step-by-step instructions necessary to get Grid Engine 2011.11 to build from source on CentOS 6.2 or RHEL 6.2. This method builds almost all of SGE without any of the “disable just about everything” shortcuts.

Short link:
http://biote.am/6y

Long link:
http://bioteam.net/2012/01/building-open-grid-scheduler-on-centos-rhel-6-2/

Feedback welcome.

Old versions of gridengine and new versions of Linux

Posted by – January 11, 2012

In a mailing list thread where a user asks about an older version of SGE remaining functional through Linux OS updates, Rayson replies with one of the more concise summaries of “what could cause problems” with older versions of Grid Engine and the newest Linux distros and kernel versions that I’ve seen:

My main concern is that I want to keep the OS up to date. We are
currently on RHEL 5.3 and at very least I want to update it to the
latest in the 5 stream. So if I can update the OS without worry of
breaking the current Grid Engine install, I will be happy.
RHEL 5.7 is mostly binary compatible with 5.3, the missing things are
the bug fixes.

SGE does not use secret system calls or hooks to get information from
the kernel. SGE parses the /proc filesystem to fetch most of the
low-level system information on Linux (on AIX it is a different story,
but GE 2011.11 fixed this problem as well on AIX), and if your SGE
6.2u5 installation is working on your machines, then it will (or may
be I should say “should”) continue to work after the OS upgrade.

A few things that can break SGE: Linux 3.0 (the kernel version string
difference breaks an SGE shell script), AMD Opteron 6100 series
(Magny-Cours) or 6200 series (Bulldozer/Interlagos …) – if you use
core binding, or memory expansion.

Note that all those issues are fixed in later versions of SGE, but if
you are not doing any of the above upgrades, then SGE 6.2u5 should run
happily.

Pulling Project name out of an epilog script

Posted by – January 11, 2012

A user asked how to get a job’s Project name into the context of an epilog script, Reuti nails it with another nice code snippet. I sorta love epilog stuff that scrapes the contents of $SGE_JOB_SPOOL_DIR — lots of good info in there…

And there is no special variable for the project?

Aha, you are referring to the project which was assigned already to this job at submission time. Sorry, got it in the wrong way.

There is no direct variable, but you can check:

$ PRJ=$(grep "^acct_project=" $SGE_JOB_SPOOL_DIR/config)


$ echo ${PRJ#*=}

inside the epilog.

Reconciling exclusive host access with the SGE accounting file

Posted by – January 11, 2012

In this mailing list thread there is an interesting discussion about how to correctly account for all of the Job Slots that are blocked off or made inaccessible by users submitting “exclusive” jobs that take over an entire execution host. Stuff like this becomes more and more important in the era of commodity boxes with core counts climbing beyond 48 …

I defined an exclusive tag in complex resources for users that can request “-l excl=1 ”
Such users, lock free slots of hosts that have added excl=true to consumable resources.

1) In this case, how can I get accounting for all of slots that a user cause to lock on hosts?

it’s not recorded in the accounting file where the job ran, i.e. a “pe_hostfile” like record. As a result also the blocked slots are not recorded. What you can try if you need it: use a global (or queue) prolog, which runs under e.g. the sgeadmin account by prefixing it with the name (see `man queue_conf`, resp. `man sge_conf`) and copy the “pe_hostfile” with the information of the distribution to a central place. Whether the exclusive access was requested, is recorded already in the accounting file (like all -l resource requests).

With this information, you can compute the real blocked cores then.

– Reuti

multiple queues without host oversubscription

Posted by – January 11, 2012

Another good tip from the mailing list

Is there a way to assign node to multiple queues but not make the
node over subscribed? Are there any tips or reference site with the
information?




As Reuti said, one option is an RQS (resource quota set). I stole this
one from somewhere (my cluster has hosts with differing numbers of CPU
cores):


{
    name         limit_slots_to_cores_rqs
    description  Prevents core oversubscription across queues.
    enabled      TRUE
    limit        hosts {*} to slots=$num_proc
}

sharetree usage data via command line

Posted by – January 11, 2012

Yet another SGE feature I had no idea about!
From the mailing list:

Getting share tree usage from the command line?

Short answer is that a “sge_share_mon” binary actually exists. I plan on checking this out, it would be nice if it had an XML output option as well.

Web GUIs for Grid Engine

Posted by – January 11, 2012

In this thread on “Web GUIs” the following SGE front ends are discussed (besides xml-qstat).

Gaussian & LINDA integration with GridEngine

Posted by – January 11, 2012

Reuti answers a question about integrating the LINDA parallel environment used by Gaussian:

We’ve purchased a license for Gaussian with support for parallelism
via Linda. A quick google doesn’t show up any tight integrations for
Linda/SGE.
i)Does anyone have a working tight integration config for Linda? Or
even a Gaussian specific one?

it was several times on the list. You can use the one for G03-D.01:

http://markmail.org/message/sgg3dhngpx75i45u

The name you specify in linda8.2/opteron-linux/bin/linda_rsh to replace the full path is just a name. It could even be foobar and you create a link called foobar by start_proc_args in $TMPDIR. The final communication method is the one defined in SGE’s qrsh_startup/qrsh_daemon. It’s just rsh to make it short.

– Reuti

Long Overdue

Posted by – November 18, 2011

You are looking at the same old blog with a new foundation. After years of dealing with the hassle of the Typo blogging engine I’ve finally migrated the site over to the WordPress blogging platform. The blog does not look much different yet as the ‘scribish’ theme is available on both platforms.

The move to wordpress will stop the ruby and RAILS-related memory leaks that were causing this site to fall over all the time. I’m keeping a careful eye on the error logs so I can find and catch any permalink, RSS or XML related errors. Hopefully with enough work I can put in rewrite or redirect rules so that any link that worked on the “old” gridengine.info will continue to work. If you notice anything odd or want to highlight something that is broken please feel free to drop me a line at ‘dag@sonsorol.org’.

Bring the Pain (SGE 6.2u5+ on Mac OS X 10.6.7)

Posted by – March 30, 2011

{crossposting to bioteam & gridengine.info }

The entire point of this post is to show a modern, free (non Oracle) version of Grid Engine running on a 64bit Mac OS X system running OS X 10.6.7…

The screencaps are from “Grid Engine 8.0.0alpha” taken from the Univa-controlled fork hosted over at github.

This is a side effect of my efforts to get closer to the SGE source and get back into the swing of being able to (relatively) easily build binaries from the source tree when needed.

Eventually my personal “courtesy binaries” for Grid Engine will all end up here:

http://biote.am/sge

Realistically speaking I think BioTeam will try to maintain binaries for 32-bit and 64-bit versions of Linux as well as 64-bit versions for the shrinking world of Mac OS X based cluster systems running Grid Engine.

Running Grid Engine on a Mac is pretty much a losing proposition in 2011 now that Apple has clearly made the switch to becoming a consumer electronics company. As an enterprise operating system, server or compute node it’s a choice now only for legacy systems and perhaps engineering nerds like myself who use Mac laptops as workstations and like to have our “tools” running locally…

This particular OS X build actually went pretty smoothly, I managed to get all components to compile including the X11 ‘qmon’ GUI and the Java-based GUI installer. Had to build with “-no-secure” though which removes the SSL-based “CSP Secure Mode” installation option. I’ll keep trying to get the integrated SSL stuff to work but it’s not a dealbreaker — in many years I think I’ve only found one or two sites that actually operate SGE with certificate-enabled CSP mode.

OS X binaries for Grid Engine have not been uploaded to the
http://biote.am/sge site yet as I want to test on a different Mac then the build host just to make sure nothing will bomb out due to missing dependencies.

All the usual problems of running SGE on 10.6+ based systems apply though. I had to use the method described in my 2010 blog post to actually get the daemons running.

Osx sge 1

Osx sge 2