Cluster healing
Cluster nodes can become unresponsive or unusable due to lost connectivity, daemons crashing, hardware issues and many other reasons. When this happens the COE will mark them as unusable or 'not ready' and reschedule workloads elsewhere, leaving the cluster at reduced capacity.
Magnum could handle the cluster recovery (healing) triggering node replacement or recovery whenever an issue is found. This could be done in two ways:
* triggered by the user, via a openstack coe cluster heal <cluster-id> command
* triggered by a periodic task, monitoring the state of the clusters
Blueprint information
- Status:
- Not started
- Approver:
- Spyros Trigazis
- Priority:
- Medium
- Drafter:
- Ricardo Rocha
- Direction:
- Approved
- Assignee:
- Ricardo Rocha
- Definition:
- Approved
- Series goal:
- Accepted for rocky
- Implementation:
- Unknown
- Milestone target:
- rocky-final
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
Gerrit topic: https:/
Addressed by: https:/
Add Cluster Healing specification
Addressed by: https:/
Add health_status and health_
Gerrit topic: https:/
Addressed by: https:/
Add health_status and health_