Monitoring and managing device health in LAVA
We have begun to run automated daily "health" tests on the devices in lava. These jobs are known-good images that have previously passed. The goal of this testing is to expose issues with the hardware, infrastructure, or LAVA itself that could prevent test jobs from passing when normally they should. These jobs are intentionally quick-running to avoid interfering with other jobs as much as possible. We should consider better ways of visualizing the results of this so that we can more easily see when things look ill, and how we can eventually even have lava respond by offlining boards once we can be sure that failures in the health checks equate to a problem with the device itself.
Some of the things we should discuss are:
1. Health check UI
Spring has been working on this, we should review what he has so far, and talk about what we'd like to see here for helping us more easily track the health of machines
2. automatic detection and response to problems
Once we are at a point where we feel comfortable that these jobs will ONLY fail when there's a real problem, we should make sure we have the pieces in place to automatically offline the board, and notify the team that something needs to be looked at
Blueprint information
- Status:
- Not started
- Approver:
- Paul Larson
- Priority:
- Undefined
- Drafter:
- Spring Zhang
- Direction:
- Needs approval
- Assignee:
- Spring Zhang
- Definition:
- Discussion
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by