SGE testbeds: Simulate mass numbers of exec hosts

Posted by chris Fri, 02 May 2008 12:42:00 GMT

Interesting message on the developers list recently as a comment attached to Issue 2364. Within, Andreas explains the use of SIMULATE_EXECDS=true parameter that allows unrestricted execution host creation (via suppressing unknown host errors).

I can see this as being very useful for testing SGE scheduler and policy configuration settings before implementing them on production systems.

From the comment:

This is a short HOWTO for the use of the cluster simulator:

(1) Start with installing a new SGE cluster as used, but
install not more than the qmaster itself

(2) After successful installation use qconf -mconf to set

    SIMULATE_EXECDS=true

in qmaster_params section of sge_conf(5). This causes the
suppression of the 'unknown' queue states.

(3) Make sure the "all.q" and any other queue that you
configure does not use any 'load_threasholds'. Cluster
simulator has no means to anyhow emulate load values. As a
result there will be no load values. For that reason
load_threasholds may not be used as it would cause load
alarm queue states that prevent scheduler from dispatching
jobs into your queues.

(4) Use qconf -ae|-Ae to create arbitrary number of
simulated execution hosts. The hosts needs not exist as
qmaster anyways won't try to send anything to it, but the
hostname must be resolvable.

Optionally:

(5) If you care for scheduler runtimes set

     PROFILE=true

in the params section of sched_conf(5) using qconf -msconf.

Now your simulated cluster is ready. You can send in
arbitrary numbers of jobs. Due to (2) and (3) scheduler will
dispatch them and send corresponding orders to qmaster.
Qmaster will behave as if it would start the jobs, but it
raise timers to ensure job state transitions are passed as
used. What won't work is interactive jobs (i.e. qrsh, qsh
etc.) and parallel jobs with control_slaves set to true in
sge_pe(5). Jobs' runtime can be controled via the first job
argument. That means when

# qsub -b y /bin/sleep 5

is submitted, the job will finish after five seconds.