[galaxy-dev] Possible Bug in SGE Runner when recovering jobs
gordon at cshl.edu
Fri Aug 14 13:39:29 EDT 2009
Not sure if this is a bug or my configuration problem, but:
1. universe_wsgi.ini has:
track_jobs_in_database = True
enable_job_recovery = True
2. A tool (which uses the SGE runner) is in 'queued' state:
psql# select id,tool_id,state from job where state = 'queued';
id | tool_id | state
39 | Show beginning1 | queued
3. Starting galaxy with "sh run.sh" gives:
Traceback (most recent call last):
File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/web/buildapp.py", line 61, in app_factory
app = UniverseApplication( global_conf = global_conf, **kwargs )
File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/app.py", line 64, in __init__
self.job_manager = jobs.JobManager( self )
File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 35, in __init__
self.job_queue = JobQueue( app, self.dispatcher )
File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 112, in __init__
File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 137, in __check_jobs_at_startup
self.dispatcher.recover( job, job_wrapper )
File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 661, in recover
self.job_runners[runner_name].recover( job, job_wrapper )
File "/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/runners/sge.py", line 358, in recover
sge_job_state.old_state = DRMAA.Session.QUEUED
AttributeError: type object 'Session' has no attribute 'QUEUED'
The immediate cause is that the DRMAA Session class does not have a "QUEUED" state.
The closest thing is "QUEUED_ACTIVE" state, as evident from both sge.py lines 18-28, and from DRMAA.py line 363.
On a related note, I'm not quite sure how job recovery works with SGE:
Let's say a job is queued with QSUB, then galaxy is stopped.
The database still says "queued", but in the meantime, the SGE cluster might run the job, and even complete it -
So the job can in fact be 'running' or 'ok' or 'error', but Galaxy thinks it is queued.
What will happen when Galaxy is restarted ?
Is the job automatically restarted, or is the new state queried from the SGE ?
More information about the galaxy-dev