Improving Reliability of Software RAID
mdadm currently has a gaggle of open bugs, and every cycle the RAID ISO tests produce new and interesting bugs. It seems like we're doing something a bit wrong with Software RAID. There are some valid solutions here: https:/
Blueprint information
- Status:
- Started
- Approver:
- Steve Langasek
- Priority:
- High
- Drafter:
- Dimitri John Ledkov
- Direction:
- Approved
- Assignee:
- Dimitri John Ledkov
- Definition:
- Approved
- Series goal:
- Accepted for raring
- Implementation:
- Started
- Milestone target:
- ubuntu-13.04
- Started by
- Steve Langasek
- Completed by
Related branches
Related bugs
Whiteboard
Past Points:
[kees] Collect historical work done on improving raid: TODO
[kees] Write detailed specification of mdadm initramfs requirements: TODO
[kees] Write detailed specification of mdadm post-initramfs requirements: TODO
Test RAID over LVM and LVM over RAID: TODO
Notes from etherpad:
there are a lot of bugs
- put together a tree of failure conditions
- map the intention of how to deal with it
- check existing code against intention, fix deltas
- suggest/recommend smart monitoring in servers
- Look into automated testing of ALL supported RAID modes
- Test case for LVM over RAID
- Investigate and test booting without mdadm.conf
- Investigate not autostarting certain arrays
- Interface with upstream for feedback (invite to UDS-P)
- Fully document (maybe in conjunction with upstream?) and review existing documentation around software raid debugging and general maintenance.
Multiple locations for various disk/array information can be used to diagnose what is where, and who's done what, for example:
** ll /dev/disk/by-id/
** mdadm --detail /dev/md127
** cat /proc/mdstat
** messages from kernel include references to "ata*.*" with no easy way to trace it back to a "real" /dev/sd* device
** lshw
** how to identify out of the drives, which is which? often "dd if=/dev/sd* of=/dev/null" (where * is the failed drive) and see which drive is solidly active.
- The upstream documentation https:/
Intentions
- Preserve Data Integrity
- Detect Known Failure Modes as early as possible
- Allow system to run and reboot fine even if partial hardware failures occur(ed).
- Provide Options for how to handle failure modes:
- The BOOT_DEGRADED=false option allows admins to safeguard against mdadm bugs and to do recovery manually, but should default to true.
Failure Modes
- Degraded array at boot
- Failed drive at runtime
- Removed drive at runtime (where metadata is intact)
- Adding out-of-sync drive
- Adding failed drive
- Drives producing corrupt reads without failure
- Old RAID configuration resurrection
- LVM starting up on mirror halves
- Hardware is fail_ing_ (SMART details)
- can we link the SMART data to some kind of user reporting?
Validate the behavior of drivers/
- does the driver notice a yanked drive?
- does the driver notice a failed drive?
- how does the driver react to a new drive getting inserted?
** New drive being inserted with alternate raid config
** Some controllers are hot swap, some not, how to identify?
Debugging failure cases (user side?)
- logic to align dm/ata information as expressed in dmesg etc with the /dev/sdx devices that mdadm knows about
Links:
https:/
drussell 2011-05-17: Added more content to the etherpad session post UDS...
slangasek 2011-10-31: this has been reproposed for a session at UDS-P, but I don't think any of the facts have changed. Why is another session needed for this?
cbyrum 2011-10-31: Agreed Steve, this just needs to get done, nothing has changed.
drussell - 2011-10-31: Absolutely agreed... so how do we focus on getting this done?
dmitrij.ledkov 2012-05-18: Adding foundations-
foundations-
Work Items
Work items:
create (Ubuntu|Upstream) RAID Architecture Specification: INPROGRESS
update existing (i.e. out-of-date) RAID documentation: TODO
identify and document all failure conditions: TODO
investigate if foundations-
test existing handling of failure modes for RAID: TODO
establish automated RAID testing and RAID failure condition testing: BLOCKED
review blueprint linked bugs for likelihood of fixing this cycle: DONE
get input from kees on the whole topic of Reliable Raid: DONE
backport critical reliability fixes to 12.04 LTS: DONE
Dependency tree
* Blueprints in grey have been implemented.