Fixing an array with mdadm goes wrong

Now this is complex.

I have an mdadm-created RAID5 array consisting of 4 discs. One of the discs was dropping out, so I decided to replace it. Somehow, this went terribly wrong and I foolishly succeeded in marking two of the (wrong) drives as faulty, and then re-adding them as spare.

Now the array is (logically) no longer able to start:

mdadm: Not enough devices to start the array.

Degraded and can’t create RAID,auto stop RAID [md1]

As I don’t want to ruin the maybe small chance I have left to rescue my data…

This sure is complicated. Obviously, if you fail two array members, RAID5 goes down. Worse yet, once this happens, it stays down. You can’t tell it to accept the spares back in a normal way. Theoretically, some more fiddling with mdadm can force the array back into shape, but I doubt it is safe given a DIY environment. If your unit is still under warranty (this particular case was with Thecus), then by all means open a ticket and ask them to fix the issue – they are pretty good with mdadm. If the case is beyond Linux repair, fall back on our Home NAS Recovery – we are pretty good too.

Dead disk during expansion

Failed expansion + dead disk = lost data

A few weeks ago I replaced one of the disks in my ReadyNAS NVX with a larger one. The expansion process seemed to complete successfully… This morning one of my disks went bad…maybe due to a brief power outage…The NAS appeared to just be in a weird state, so I powered it off cleanly, restarted it, and told it to do another scan.

Now the NAS is telling me that disk #1… is “spare”, not part of the RAID array. It says disk #3 — the one that appeared to fail this morning — is just gone. Since there are two “failed” disks, the array is “dead’ and my data is gone.

Is there anything I can do at this point?

Expansion is a fragile process. All the disks of the original set must be in perfect shape before expansion. Expansion (also called reshape, because the array geometry is changed) requires every sector on every disk to be first read and then written to.

Normally, the expansion process survives the power outage. It certainly does survive a normal shut down or an UPS-initiated shutdown. A smart UPS can tell the NAS that the power is lost and the NAS then proceeds to shut itself down without any human intervention. This is certainly not a problem. Sudden power cuts are more of a problem, but the damage, if any, is usually well contained.

However, a drive failure during expansion makes a rebuild tricky. Theoretically, the RAID is still redundant, because the reshape algorithm is designed to maintain redundancy throughout the process. In practice, once the drive fails, accounting for what is where and how to recompute data from the parity suddenly gets complicated. Any further failure results in a half-reshaped array which is a mess to fix and certainly beyond the abilities of automatic recovery.

What can be done to minimize the chance of the failure?

  1. Think twice if you need the expansion. The traditional way, used before the on-line expansion, is to back the data up, verify the backup, destroy the original  array, build a new array, copy the data back. This method still works.
  2. Have backup before expanding the array. Once you have a backup, there is no requirement to destroy the original array. You still have the expansion capability. If something goes wrong, you have a backup.
  3. If the data is not that valuable and a risk of losing it is deemed acceptable, make sure you check SMART status on all the array disks and do an extended test of the disks (if your NAS allows that).

Maxtor NAS

Maxtor was long ago acquired by Seagate, but you can still come by their NAS. This is one of the older cases:

I have a 1TB Maxtor drive attached to my network. It has recently failed and I can no longer access it. However, it is still shown on the network list in Windows Explorer, …. I can ping the drive and get a return. … important that I can get my files back. Does anyone have any ideas on some software that might be able to access it?

Sure, I have. Home NAS Recovery works with Maxtors too. In the end, it is the equivalent of the modern Seagate NAS, albeit with only one disk.

It may be interesting to note that the poster refers to the NAS as 1TB drive. This may cause some confusion with a regular external drive, but the external drive is attached to the PC, not to the network, and you cannot ping the external drive.

Single disk LaCie Cloudbox

How does one recover it in cause of failure?

…I own a Lacie Cloudbox [which] just stopped working suddenly. Doesn’t seem to be a physical disc problem, more like file system … It uses RAID (single disc) … If there’s someone here that would be into walking me through the steps involved to mount this drive in Mac, Windows, or Linux, that would be amazing.

First thing is that single-disk unit, with no provision to install a second drive, does not need RAID. Despite that, most NAS vendors use the same firmware for the entire product lineup. This has a side effect of single-disk models being unnecessarily complex. There still will be multiple partitions, and instead of using a simple partition for data, an md-raid JBOD will be used.

Now, let’s move on to the actual problem at hand. If a disk fails in a single-disk unit, it is a job for a skilled technician, no way around it. If it is a filesystem issue, this is a job for recovery software (like our www.nas-recovery.software).

Theoretically, one may want to try to access data with a Linux, but that’s not likely to have effect. The NAS uses Linux internally; if the Linux was able to read data, there will be no need for recovery. The recovery is required precisely because Linux can’t access the filesystem any longer. While with a failed RAID some clever jiggling with mdadm parameters can (and often does) solve the problem, in a filesystem there are much fewer parameters to fiddle with. A single-disk unit can’t have problem with its RAID because it has no RAID, so we’re going straight to the filesystem level.

Iomega StorCenter ix2-200 device fails when powered

This is about an Iomega/Lenovo device, Iomega StorCenter ix2-200

When I power on the ix2 the ‘!’ light blinks red. I can’t connect to it…, under ‘dashboard’ everything seems as usual except the pie chart showing space usage is not there. … when I click on Users or Shared Storage it says Disks Not Ready. The selected function is not available due to the state of the disks.

… what to do from here?

Although it is not specified anywhere in the post, further discussion suggests there are two disks in the NAS. Another crucial bit of information missing is the RAID level. There are three possibilities with two disks:

  • RAID1, when two disks are identical;
  • JBOD, when the data is first stored on one disk, then once the first disk fills up, the second one is used;
  • RAID0, when the data is interleaved between two disks.

Red exclamation sign on a StorCenter indicates either a non-recoverable disk failure, or some kind of severe logical failure.

First thing to try, as rightly suggested in follow-ups to the original post, is to try booting with just one disk. This works if the array is RAID1. Two tests must be done because even with RAID1, there is an even chance of leaving the bad disk in.

If the array is RAID0 or JBOD, and the drive ha indeed failed, the drive must be repaired first. If there is no mechanical problem, but rather some logical issue, we can help you with RAID0, but not with a JBOD.

 

Lenovo PX6-300D

This describes behavior of the PX6 6-bay Lenovo NAS with multiple disk failures,

I have px6-300D nas with 3TB X 6 drives. I configured it with Raid 5. Few Days back it was showing a message The amount of free space on your ‘Shares’ volume is below 5% of capacity. and asked to overwrite Drive 6…Then i contacted customer care they told that your few drives (3 or 4) has failed. … and go with some data recovery solution provide… If its NAS with raid protection my data must be protected. I really need my data back.

RAID protection is great but it has its limits. It does not protect against anything else than disk failure, and RAID5 only protects against a single disk failure. Multiple disks fail, down it goes.

Reconstucting restarts at 45% starts with 0.

That’s what it looks like when implemented by Lenovo. Other vendors will have different indications, but the end result is the same and the array cannot be rebuilt. Short of packing the disks for a data recovery service, what else can be done?

  1. Cheapest option is to remove all the disks from the NAS, clone them to a set of new disks of the same capacity, and put the clones back. The NAS will hopefully pick up the copies and completes the rebuild successfully.
  2. If the rebuild does not pick up, our Home NAS Recovery software can in all likelihood do the job.

 

Is it possible to pull each disk?

This case brings the same age-old question, will it work with multiple drives in turn?

I recieved an email from the NAS that drive 1 had failed. Then I saw the following message:

4 new drives with existing data have been added to your Iomega StorCenter device. The Iomega StorCenter device failed and some data loss may have occurred.

And on the LCD screen there is a prompt for permission to overwrite each disk which I have not done because I need to preserve my data…I’ve tried everything that I know how to do. Does anyone have any receommendations? Is it possible to pull each drive and and capture the data from each by connecting it to another pc?

Well, talk about some data loss.

I will have my usual recommendation of Home NAS Recovery software. But there is one caveat – Home NAS Recovery requires all disks to be connected at the same time. You can’t capture data from each, you need to capture data from all the disks.

Basic Synology case

This is the simplest possible example of recovery, the unit being unidentified Synology.

Hard drive in my synology nas crashed yesterday. The disk was installed as a basic disc without protection. … storage manager … showed the disk as “not initialized”. I took out the disk and connected it to my windows computer to try recovering the files. I used “Ext2 volume manager” to see the hdd and it shows me 3 partitions … EXT3, SWAP and RAW. On the EXT3 there are some files but not the one I had saved on the hard drive.

How do I find the files that I had saved on the hdd?

And how do I know that my hard drive really is broken? The synology storage manager is not able to finish S.M.A.R.T Test.

As I do fairly often, I will address the last question first. If the SMART test cannot be completed, the drive is broken.

As far as partitions go, first one is most often firmware. SWAP is just what it says on the tin, and RAW is either data (which is probably maintained by md-raid so it was not recognized) or something broken. In any case, the first EXT3 partition is useless for recovery; you will get back some Linux binaries, not your data. On a bright side, Home NAS Recovery can see through the md-raid structures, identifies the partitions on its own, and reads EXT quite well.

Moving disks between NASes

There is a frequent question I’d like to address: is it safe to move disk packs between identical NASes?

The full version may well go like this

Last night my Ready NAS (RND4000) stopped working. It looks like a fault in the power supply…I would like to recover some…files.

One option I see is a friend of mine who owns the same system. Is it an option to shut down his NAS, remove/replace the installed disks with the disks from my NAS and copy the data? Or are any obstacles in the way?

All in all, it should work. Possible obstacles include

  1. Different NAS models. Obviously, the NASes must be of the same vendor, same product line, and in most cases models must be fully identical.
  2. Different firmware. Preferably, both NASes should use the same firmware version. However, once the unit fails, there is no way to determine exactly which firmware it was running at the time of crash. In this situation, the recipient unit is better be patched to the latest firmware. The latest firmware will typically accept disk packs created by older versions.
  3. Damage to the disks, either physical or logical. When the power supply blows, it may take the NAS with it; also the disks may or may not be damaged by electrical transients. If this happens, the replacement NAS is not going to work, obviously.

If you think none of this applies to your situation, you may give it a go.

Know your RAID level

One ReadyNAS owner seems to be confused about what RAID level is (and was) used on its ReadyNAS Duo (full story).

I have a readyNas Duo with 2x 1Tb disks in raid. … I had to reset the NAS, unfortunally i holded the reset button to long so the disks are wiped clean….

After the reset i upgraded to the latest firmware and let the disk sync.

… is my data lost, or are there way’s to recover the data?

i tried to recover my data with [RecoverMyFiles] …without success…I checked only one disk since i wanted to have the other one as an backup to be examined by a company … [and they] … think my disks were not in mirror but in striping mode…I can’t check this ofcorse but i never saw more totall space then the size of one disk.

This looks real bad. There are two things we know for sure:

  1. this is ReadyNAS Duo, and
  2. there are two physical drives.

and that’s all. There are four conflicting bits in the above quote relevant to the RAID level.

  1. [did] let the disk sync suggests RAID1. RAID0 does not need any kind of sync.
  2. wanted to have the other one as an backup indicates owner’s belief that the array is RAID1
  3. company … thinks … striping mode, that’s pretty straightforward.
  4. never saw more totall space than the size of one disk, which again points to RAID1.

Problem is, it is fairly easy to recover data either from a RAID0 or from a RAID1. Home NAS Recovery, as one obvious example, can work either way, and does not even need to know the RAID level beforehand. However, if the initial array was RAID0, and then after a reset the NAS switched to RAID1 mode and copied contents of the one disk to the other disk, there is nothing left to recover.