Dead disk during expansion

Failed expansion + dead disk = lost data

A few weeks ago I replaced one of the disks in my ReadyNAS NVX with a larger one. The expansion process seemed to complete successfully… This morning one of my disks went bad…maybe due to a brief power outage…The NAS appeared to just be in a weird state, so I powered it off cleanly, restarted it, and told it to do another scan.

Now the NAS is telling me that disk #1… is “spare”, not part of the RAID array. It says disk #3 — the one that appeared to fail this morning — is just gone. Since there are two “failed” disks, the array is “dead’ and my data is gone.

Is there anything I can do at this point?

Expansion is a fragile process. All the disks of the original set must be in perfect shape before expansion. Expansion (also called reshape, because the array geometry is changed) requires every sector on every disk to be first read and then written to.

Normally, the expansion process survives the power outage. It certainly does survive a normal shut down or an UPS-initiated shutdown. A smart UPS can tell the NAS that the power is lost and the NAS then proceeds to shut itself down without any human intervention. This is certainly not a problem. Sudden power cuts are more of a problem, but the damage, if any, is usually well contained.

However, a drive failure during expansion makes a rebuild tricky. Theoretically, the RAID is still redundant, because the reshape algorithm is designed to maintain redundancy throughout the process. In practice, once the drive fails, accounting for what is where and how to recompute data from the parity suddenly gets complicated. Any further failure results in a half-reshaped array which is a mess to fix and certainly beyond the abilities of automatic recovery.

What can be done to minimize the chance of the failure?

  1. Think twice if you need the expansion. The traditional way, used before the on-line expansion, is to back the data up, verify the backup, destroy the original  array, build a new array, copy the data back. This method still works.
  2. Have backup before expanding the array. Once you have a backup, there is no requirement to destroy the original array. You still have the expansion capability. If something goes wrong, you have a backup.
  3. If the data is not that valuable and a risk of losing it is deemed acceptable, make sure you check SMART status on all the array disks and do an extended test of the disks (if your NAS allows that).