Recover from "stuck" states on compute manager start-up
If a compute manager is stopped / fails during certain operations then the instance will be left stuck with a transitional task_state. Ideally during compute manager start-up we would identify instances in these states and transition them to a logical stable state.
Blueprint information
- Status:
- Started
- Approver:
- Dan Smith
- Priority:
- Medium
- Drafter:
- None
- Direction:
- Needs approval
- Assignee:
- David McNally
- Definition:
- Review
- Series goal:
- None
- Implementation:
- Needs Code Review
- Milestone target:
- next
- Started by
- Dan Smith
- Completed by
Related branches
Related bugs
Bug #1228804: Need raise exception when performing action on instance if compute service is down | Expired |
Sprints
Whiteboard
Moved to -next as this has two -2s and no responses --dansmith
Sponsors: John Garbutt and Dan Smith
(taken from https:/
Cleaning up "Stuck" instance state
What do you mean by "Stuck" ?
"Stuck" state in this context occurs when an action fails to complete in the computer manager.
Typically seen on failure / restart
Why do you care ?
In some as state gates actions it stops you from being able to move forwards
Relying on the user to clean up is a real pin when you want to migrate an instance
It's confusing for the users (which means we have to spend time diagnosing and helping to fix it)
Isn't this all going to be fixed by the task manager / clean-shutdown ?
Probably - but there some even quicker wins that also help towards that, and some issues that
are also going to be relevent to task manager.
Basic Premis: The one time you know there is no running thread in the compute manager is during start-up.
At that point there are some task states that can be safely cleared / re-processed. The tricky thing is to
disambiguate between an action which has started and failed to complete, and an action which is actually still
on the message queue (given that the compute manager may have been down for some time)
A bit of history:
We tried to address all of these and disambiguate the "still queued" case by recoding the task_state seen on the compute manager at the
start of the action, but that was (rightly) blocked on because it involved more DB access and is going to be fixed by task manager.
Are now re-working some easier cases that don't need the disambiguation.
https:/
Easy cases:
Deleting: It's always safe to go ahead and rerun the delete.
Buliding: Can always be put into an error state. If the message was still on the queue instance.host won't have been set
Image_pending_
Powering Off: re-run the power off. If the VM is already off, or the request is in the queue this is a no-op.
Powering On: re-run the power on: If the VM is already off, or the request is in the queue this is a no-op.
All accepted as worth doing - submit as separte patches
Harder cases:
Image_snaphot: (Set in API) - could be cleared on start-up and re-asserted on the compute manager at the start
of snapshot to cover the case of a still queued request
Rebooting:
If the VM isn't running - reboot it (risk is a second reboot)
If the VM is running - just clear the status (risk is a user needs to make another reboot)
Accepted to add additional task_state value to be set on compute manager to disambiguate the queued vs started case
Even harder:
Rebuilding: Would be nice to be able to treat this like Building and go to an error state, but we can't use instance.host to
disambiguate. We could do something here if we add an extra task state (Rebuild_started) that is set immediatly on the
compute manager. Could use the same approach to remove the risk of missed / additional reboots.
As above
Gerrit topic: https:/
Addressed by: https:/
Recover from IMAGE-* state on compute manager start-up
Gerrit topic: https:/
Addressed by: https:/
Recover from build state on compute manager start-up
Addressed by: https:/
Make compute manager _init_instance use native objects
Addressed by: https:/
Recover from REBOOT-* state on compute manager start-up
[parthipan] how about task_state 'migrating'?
Still depends on https:/
Gerrit topic: https:/
Addressed by: https:/
Cleanup 'deleting' instances on restart
Addressed by: https:/
Recover from POWERING-* state on compute manager start-up
Addressed by: https:/
Clean IMAGE_SNAPSHOT_
===============
Remaining patches
===============
Addressed by: https:/
Recover from POWERING-* state on compute manager start-up
Addressed by: https:/
Recover from REBOOT-* state on compute manager start-up
I would love this to merge, since its so close, promiting to high --johnthetubaguy
Unapproved - please re-submit via nova-spec --johnthetubagy (20th March 2014)
Addressed by: https:/
Recover from POWERING-* state on compute manager start-up
Work Items
Dependency tree
* Blueprints in grey have been implemented.