Workflow error analysis
Now:
When a workflow fails it now may be hard to quickly find a root cause.
From CLI the only way (without creating a new execution) is to use a sequence of commands like:
* 'mistral task-list <workflow execution id>' and see what are in ERROR
* for each failed task execution run 'mistral action-
* for each failed action run 'mistral action-
* for each failed task execution of type Workflow, find the sub-workflow execution ID, and go back to the first bullet.
It is also possible to create and execute a workflow with a "publish" of all tasks and all sub-workflow tasks recursively (and also filter by tasks in error state). Example: http://
The goal:
Mistral should provide one command that allows to see a report on failed actions and how they affected the entire workflow execution. This report should also account for nested workflows.
Solution ideas/steps:
* Write a spec
* It could be implemented on a client side or a server side. The latter is faster because we won't have to make lots of REST requests.
Testing:
* Functional tests that imitate workflow failures and make sure that we get the right report.
Error examples:
* yaql expression failed: http://
* http action faild because of an invalid URL: http://
Notes:
* One of the current problems is error info cleanness. It's not easy to understand what the precise error is even if we see it.
* Idea: split the actuall error info and contextual information (e.g. stack trace)
* Idea: give an option to report inbound context and outbound context for each task
* Idea: use some sort of classification for all possible errors
* Idea: have a separate REST API endpoint to build reports on the current status of the execution and/or error analysis
Decisions:
* Write a spec first
* Add a new endpoint to generate "Workflow error analysis" reports. Same endpoint can also generate a report on the current progress of a workflow, not necessarily failed yet. It can be used, for example, for UI to track the current situation.
Blueprint information
- Status:
- Complete
- Approver:
- Renat Akhmerov
- Priority:
- High
- Drafter:
- Renat Akhmerov
- Direction:
- Approved
- Assignee:
- Renat Akhmerov
- Definition:
- Approved
- Series goal:
- Accepted for stein
- Implementation:
- Implemented
- Milestone target:
- stein-3
- Started by
- Renat Akhmerov
- Completed by
- Renat Akhmerov
Related branches
Related bugs
Bug #1643840: How to check where are the bugs when the execution status is error | Invalid |
Sprints
Whiteboard
https:/
spec: https:/
patches related to it:
https:/
https:/
Gerrit topic: https:/
Addressed by: https:/
WIP: add a workflow execution report endpoint