One of the short-comings of Grid Engine 6.0 (as compared to 5.3) is that a gap exists between when a job finishes and when the job shows up in the accounting tool. The result is that for a few seconds after a job ends, it is as though the job never existed. Many people have complained about it, and now it is fixed. With the next release of Grid Engine, that gap is closable.
A little background. One of the big changes going from 5.3 to 6.0 was making the qmaster multi-threaded. In 5.3, the qmaster was one big loop. With 6.0, the qmaster now runs in about 14 different threads. Among those threads is a timed event thread. The timed event thread is used to do things on a periodic basis. Many tasks that were handled in the main loop in 5.3 wandered into the timed event thread in 6.0. One of the things that moved over was the writing of accounting and reporting data to disk. In 5.3 the data was written as soon as it was available. In 6.0, to improve performance, the data is buffered before being written. The buffer period is controled by the flush_time parameter of the reporting_params in the global host configuration. The minimum value for this setting is 1 second.
The problem comes from the fact that two separate buffers are being written at the same time. One is the accounting information buffer, which is what qacct uses to find historical job data. The other is the reporting information buffer, which is used by ARCo to create a utility computing database. Because the reporting buffer produces massive amounts of information, in a normal system setting flush_time to any small value results in an overloaded qmaster. However, since the buffers are only flushed every flush_time seconds, there’s a gap between when the accounting information for a finished job enters the buffer and when it is written to the accounting file.
To fix this problem, we split the flush event into two: one for accounting and one for reporting. Now, it’s possible to set the accounting_flush_time, which controls the accounting flush interval, separately from the flush_time, which controls the reporting flush interval. In addition, if the accounting_flush_time is set to “00:00:00”, accounting information will not be buffered at all. It is instead written directly to the accounting file. (OK. In reality, it’s still written to the buffer, but the buffer is immediately flushed.) To maintain backward compatability, the accounting_flush_time parameter is optional. If is it not set, the accounting flush interval will be set to the flush_time, i.e. the accounting and reporting buffers will be flushed at the same time, just like with the original 6.0 behavior.
To get this fix, you either have to download the latest version of the maintrunk or s2 branch of the Grid Engine source, or you have to wait for the next release. You then must set the accounting_flush_time parameter in the reporting_params of the global host config. (See qconf -mconf global.)