Applier service failure management
This blueprint proposes enhancing the Watcher service to actively monitor the health of all running watcher-applier services. This will improve the managing of both applier unexpected failures and horizontally scaling situations, ensuring that ActionPlans are not left stranded in the event of an applier crash or scale-down operation, which currently results in Action Plans indefinite stuck in ONGOING or PENDING status. The primary goal is to maintain system resilience and consistency by automatically acting upon the stranded workload even if the assigned applier to an ActionPlan is failed.
The implementation of this bñueprint consist of implementing a similar solution to the existing one for the decision-engine (See existing improvement actitivies in https:/
- The service monitor will check when an Applier service is FAILED based on the same parameters and algorithm.
- When an applier is set as FAILED, it will find any non-finished ActionPlan assigned to it and:
- ActionPlans on ONGOING state will be cancelled and a message will be added to the status_message field of the AP.
- ActionPlans in PENDING state will be unnasigned (hostname field will be emptied) and a new launch_action_plan RPC message will be sent. That way, any available applier will pick up and execute it.
This logic will be implemented as part of a new service monitor which will run in the watcher-applier service wich will reuse as much code and logic as possible from the existing service monitor for the decision_engine.
This patch has no impact in the watcher APIs or config options.
Blueprint information
- Status:
- Not started
- Approver:
- sean mooney
- Priority:
- Undefined
- Drafter:
- Alfredo Moralejo
- Direction:
- Approved
- Assignee:
- Alfredo Moralejo
- Definition:
- Approved
- Series goal:
- None
- Implementation:
-
Not started
- Milestone target:
- None
- Started by
- Completed by
