Category: Installation

Installing on Mac OS X

Posted by – February 7, 2010

Over at this link:


http://blog.bioteam.net/2010/02/07/grid-engine-6-2-on-mac-os-x/

… I’ve posted an article and accompanying 7 minute recorded screencast showing how to manually install SGE 6.2u5 on a Mac OS X Server system. The test system in the video was running 10.5.8 but the same methods are known to work on Snow Leopard systems as well.

Windows 2008 R2, SAMBA PDC and “HOST_NOT_RESOLVABLE”

Posted by – December 22, 2009

This is a quick mailing list hit to mention that for Windows users experiencing HOST_NOT_RESOLVABLE errors due to domain binding issues, the Windows registry key:



HKEY_LOCAL_MACHINESYSTEMCurrentControlSetservicesTcpipParametersNVDomain


… might be a route to resoving the issue.

Full mailing list thread is here.

SGE 6.2 and Windows XP Writeup

Posted by – May 4, 2009

Abraham Agay from the Hebrew University of Jerusalem has posted a wiki page with detailed notes taken during the process of installing Grid Engine 6.2u2 and configuring a Windows XP Professional execution host via installation of SFU 3.5.

Abraham’s notes are here:

http://shum.huji.ac.il/~agay/sge/blog.cgi?notes

… I’m going to try to follow these instructions (and take screenshots) via my virtulized CentOS 5.3 and Windows XP guest VMs that live on my laptop. Stay tuned.

SGE 6.2 Screencasts

Posted by – March 30, 2009

Lubomir has a few screencasts up on his blog. The first one covers the process of preparing things for the new Java-based GUI installer and the second covers the GUI installation process itself.

Screencast: live install of SGE6.2 beta

Posted by – May 16, 2008

Truthfully speaking, after taking an overnight flight back to Boston from California I really was in poor shape to actually get any real work done today.

I’ve recorded my experience installing the fresh release of SGE 6.2beta on my laptop. The video screencast itself is hosted over at a BioTeam site — it’s only fair because BioTeam is paying for the hosting costs as well as the screencast recording software!

The video screencast is linked off of this blog post:

http://blog.bioteam.net/2008/05/16/sge-62beta-unboxing-screencast/

I am still on the fence as to if this screencast stuff is actually useful. Maybe it’s all just web-2.0 style style-over-substance crap. Comments appreciated and will help me figure out how much effort to put into video content vs. straight up blog or technical writing.

Installing Grid Engine on Windows systems

Posted by – January 14, 2008

In the past few months there has been a marked rise in the number of people seeking assistance with getting Grid Engine installed and running on Windows systems (of course, Windows systems are only able to be execution hosts and clients). I’ve long been meaning to experiment with this myself with an eye towards documenting the process. Fortunately though, Beat Rubischon posted an excellent writeup online (covering both 32-bit and 64-bit EMT64T Windows) along with a note to the SGE list that included this great qstat output:

[root@master ~]# qhost
HOSTNAME           ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
--------------------------------------------------------------------------
global             -               -     -       -       -       -       -
master             lx24-amd64      8  0.93    2.0G  422.0M    2.0G  136.0K
winnode01          win32-EM64T     2  0.10  383.4M  220.2M  929.6M  183.7M
winnode02          win32-x86       2  0.37  255.5M  148.8M  618.1M  140.0M

Beat’s PDF formatted writeup along with a tarball of changes he had to make can be found here:

http://www.0x1b.ch/misc/papers/sge/

Installing ARCo on x64 Linux with Blackdown Java JVM

Posted by – December 7, 2006

Todd was trying to install the Accounting and Reporting Console (ARCo) for an Opteron cluster, and got the error message:

Java setup
----------
We need at least java 1.4.1

Please enter the path to your java installation [] >> /opt/j2re1.4.2

ERROR: This java version does not support 64-bit native libraries,

       The use of libdrmaa.so from the lx24-x86 binaries would be 
       possible, but the packages are not installed.

       Please install a 64-Bit java version or the N1GE 32-bit
       binary packages for the architecture lx24-x86!

The fix is to hack the “inst_dbwriter” script to remove the “-d64” flag which is not supported by Blackdown Java.

SGE gets registered IANA port numbers

Posted by – September 19, 2006

Looks like we should “officially” be using ports 6444 and 6445 now. Expect to see these values propagate into an /etc/services file near you.

From http://www.iana.org/assignments/port-numbers:

sge_qmaster	6444/tcp   Grid Engine Qmaster Service
sge_qmaster	6444/udp   Grid Engine Qmaster Service
sge_execd	6445/tcp   Grid Engine Execution Service
sge_execd	6445/udp   Grid Engine Execution Service

Discovered when Andreas posted Issue 2099. Thanks Andreas!

Things to think about before installing Grid Engine

Posted by – September 29, 2005

If you are reading this post, you should also be familiar with the official Grid Engine 6 Installation Guide.

In the official install guide, there is a section called ”Before you install the Software”. It gives a nice tabular view of the decisions you’ll have to make during the install and explains each option briefly.

Disclaimer: This is shaping up to be one of those “Chris is injecting lots of his personal opinions into what should be a straightforward technical document…” posts. This is not an official document, it’s just some thing you found on the internet written by some guy you probably don’t even know, OK? Drop me a line to correct any mistakes I’ve made or to make me aware of something that I’ve totally missed.

Things you need to care about


hostnames and DNS



Grid Engine is pretty sensitive to hostnames and name resolution issues. Grid Engine likes DNS and it likes to do both forward (query by hostname) and reverse (query by IP address) DNS queries. Bad things will happen if the reverse lookup does not sync with the forward lookup. Even worse things happen when your /etc/hosts file(s) contain mistakes or typos.

Tip for Apple people

Just because Mac OS X lets you put spaces and funky capitalization into your hostname does not mean that this is a good thing to do. Your qmaster machine does not need to be called ”j0ez fuNky Xserve”. Feel free to do whatever you want with the computer name as it applies to “bonjour” network sharing, but keep the core system hostname something reasonable. Grid Engine and other Unix-ish bits under the hood of your OS X system will thank you for it. Actually, now that I’m on this topic, use the same conservative naming approach for XRaid storage arrays and local disk partitions.

A good test after unpacking the Grid Engine distribution files (but before beginning the installation process) is to run some of the utility binaries to see what grid engine will “see” concerning your local environment. They can be found in the “utilbin/” directory. In particular you want to run the utilities called ”gethostname” and ”gethostbyaddr” and make sure that they are reporting good information.

Here is an example run on a test machine, The Linux “hostname” command is run, then the SGE utilities gethostname and gethostbyaddr are run to confirm that everything is consistent all the way through:

[root@dcore-amd sge-6s2u1]# hostname
dcore-amd.sonsorol.net

[root@dcore-amd sge-6s2u1]# /opt/sge-6s2u1/utilbin/lx26-amd64/gethostname 
Hostname: dcore-amd.sonsorol.net
Aliases:  dcore-amd 
Host Address(es): 66.92.70.152 

[root@dcore-amd sge-6s2u1]# /opt/sge-6s2u1/utilbin/lx26-amd64/gethostbyaddr 66.92.70.152
Hostname: dcore-amd.sonsorol.net
Aliases:  dcore-amd 
Host Address(es): 66.92.70.152 
[root@dcore-amd sge-6s2u1]# 

It is not required, but certainly easier, if your qmaster machine or cluster “portal” has a valid DNS entry. Your IT organization will know how to do this. Make sure they give you a static IP address as well!

Consistent username, UID and GID values

Regardless of how you plan to do user authentication (LDAP, NIS, NetInfo or local files) the key requirement is consistency. Make sure that all your users exist on all the nodes and that each user has unique and consistent UID/GID values.

Shared filesystem options

If you plan to install into a shared NFS filesystem, make sure the server is not mounting the filesystem with options that block the root user or remap the root UID value to a non-privledged value. Grid Engine can run as a non-root user but it needs to be started by root. There are also setuid binaries in the distribution that will break if root-squashing is enabled. Most people run shared NFS cluster filesystems over a private network subnet or VLAN, making issues of NFS security less of a concern.

Classic Spooling vs. Berkeley-DB Spooling

This may deserve a post/article all by itself. The hard thing about this decision is that the spooling method is one of the few Grid Engine things that you CANT change without doing a complete reinstallation. The official documentation makes it clear that berkeley based spooling gives better “performance” but it does not explain in enough detail the downsides. It also does not make it clear that many people (especially those with small clusters) generally will not notice a difference between the two spooling methods.

The argument for using berkeley spooling is pretty clear – it’s what the developers are concentrating on for future development and it is faster than classic mode. The downside to berkley spooling is somewhat understated – When you choose berkeley spooling you are also giving up a key fault tolerance feature of Grid Engine. Currently, berkeley spooling limits you to choosing to store your spooldir on a local non-NFS filesystem, OR on a remote spooling server. Sadly, only one remote RPC spooling server is allowed so the RPC host becomes a potential single point of failure. The RPC argument is a bit of a stretch though, as eventually this will be fixed and one of the reasons for choosing berkeley databases in the first place is so the Grid Engine developers could leverage the berkeley community and codebase for things like database failover and remote replication. Not reinventing the wheel is a good thing. The real hassle with berkely spooling in my mind is losing the wonderful plaintext ASCII configuration and state files that can be so easily read, backed up, understood and even (in emergencies!) directly edited by hand.

The simple truth for me is that the benefit of having “faster” spooling is not worth having all my critical state and configuration data stored in binary form. Your requirements could be completely different. For instance, if you foresee having to run “qsub” to submit 150 jobs per second to the grid, then you probably want berkeley spooling as this interesting mailing list thread points out.

My recommendation is this – If you are just starting out with Grid Engine, use classic spooling. If your cluster is less than 20 nodes in size, use classic spooling. Once you have the system up and running for a while you’ll easily be able to tell if your standard sorts of workload and workflows are being affected by spool performance. By that time, you’ll be comfortable enough with Grid Engine that you’ll have no trouble backing up your configuration and reinstalling with berkeley spooling enabled.

Things you don’t really need to stress out about


Grid Engine ‘Cells’

Don’t worry about cells at install time. Don’t worry about cells ever. All cells allow one to do is ”run multiple grid engine instances off of the same set of binaries”. Wow. This could be a holdover from the “codeine” product days where disk was expensive and fileserver space was a rare commodity. Or possibly now when one only has access to a single high performance shared storage system. I’m not knocking the cell concept as much as I’m trying to make the point that most people will never make use of more than one cell. The people who need cells know who they are, and the rest of us can continue on using “default”.

Think of the cell simply as “the directory inside my $SGE_ROOT where the system stores all my site specific and unique stuff”. The default installation suggestion of using ”default” as the cell name works perfectly fine and it just means that the path to your site-specific startup files etc. is going to be $SGE_ROOT/default/…

Shadow hosts, scheduler tuning profile

These things are easily configured after installation. No need to stress out about them or make any set decisions. The same goes for just about every question you are asked during installation. The only thing you really can’t change after install is the spooling method. If you don’t know (or don’t care) about a particular install-time issue, just accept what the installation script offers as a default. It can’t hurt and you can easily change it later.

Things I wish someone had told me about


The automatic install scripts are not worth dealing with on small clusters

The problem with the automatic install scripts is that the template file must be 100% correct (I never get it correct the first half-dozen attempts) and that when things go wrong, they go wrong almost silently. There is no good debug output except for the messages that may or may not get logged to /tmp on the compute node. Your best bet for dealing with automated install script issues is to edit the inst_sge script to change the first line from ”#!/bin/sh” to ”#!/bin/sh -x” so that it runs with verbose debug output. The next best thing is to ask someone who has a working template file to share. You can then edit this template to match your specific needs and it may actually work the first or third time you try to run it.

For clusters smaller than 30 nodes in size (where I already have passwordless SSH access set up) it is actually quicker for me to manually log into each node and invoke the “./install_execd” script by hand. For larger systems, or systems where I want to have an automated cluster setup/teardown process I have a cache of “known good” template files that I can modify to suit the local setup.

The host_aliases file

When Grid Engine first starts up, the qmaster node writes what it believes to be its hostname to a file located in $SGE_ROOT/(cell)/common/act_qmaster. In many cases, the hostname that gets written to the act_qmaster file is the fully qualified public hostname of your cluster master node. In many of the same cases, this public hostname MAY NOT be the same as what your compute nodes use to speak with the same machine.

Imagine the following scenario: A cluster “portal” node with 2 network cards. One card is connected to the internet or an institutional network so that people can actually connect to the system. A DNS lookup on the machine’s hostname will return this “public” IP address. The other network card has a different IP address and is connected to a private cluster network where all the compute nodes can be found. The problem occurs whenever a sge_execd daemon starts up on a compute node. The first thing sge_execd will do is read the act_qmaster file in order to learn which host it needs to connect up and register with. However, this presents a problem because ”act_qmaster” contains the PUBLIC hostname of the qmaster machine. The sge_execd daemon is going to try to do a DNS lookup on the public hostname and will then try start connecting to the IP address associated with the public hostname

In simple terms, the SGE host_aliases file allows you to remap the hostname or IP that your compute nodes are going to try to connect to when trying to join the cluster or speak to the qmaster process. This is particularly useful on machines with more than one IP address and hostname.

The sge_aliases file

I use this far less frequently than host_aliases but it still comes in handy. Simply put, this file allows path aliasing.

One example of where this comes in handy is on Apple Mac OS X clusters that use externally attached RAID storage arrays. From a user’s perspective both within the command-line environment and the GUI, their home directory is something like ”/Users/username”, but when Grid Engine checks this path it sees something along the lines of “/Volumes/XRAID/Users/username” and this path mismatch can cause problems when trying to run jobs. The sge_aliases file makes it easy to tell grid engine that the path ”/Users/*” is functionally the same thing as the path ”/Volumes/XRAID/Users/*”.