Roland: things that affect job deletion time

Posted by chris Mon, 12 May 2008 14:45:49 GMT

In this interesting users-list thread, Roland provides some nice comments on the various things that can affect the time it takes to delete a Grid Engine job.

Specifically mentioned is a new hash implementation slated for the upcoming 6.2 release that dramatically improves things.

From Roland's post:

...for GE 6.2 I've analyzed the hotspots deleting jobs and what I've found is:

1) the time deleting a job increases with the amount of pending jobs in the cluster and the amount of queue instances. The reason for this is the messages list for schedd_job_info. Every message in the qstat -j output is one list element and below this element are the job id references stored inheriting this message. At job deletion time qmaster has to loop over the whole list of messages and loop over all references to removes right one. As a matter of fact this does not scale, and for 6.2 I've added a hash access to the reference id that decreased the job deletion time in large clusters heavily. Sadly I don't remember the exact numbers.

To verify this you can disable schedd_job_info in the scheduler config and then delete your jobs.

2) The job script and the job itself needs to be removed from the database. This time depends if you use berkeleydb or classic spooling and if you spool on local storage or on a NFS share. As faster your access to the storage is as faster you can delete the jobs.

If disabling schedd_job_info doesn't help in your case you might be hit by this point.

3) With 6.1u3 we've introduced the parameters gdi_timeout and gdi_retries to tune this behaviour. But that's anyway more a workaround than a real solution.