Search This Blog

2014-01-05

RAID Arrays and Disk Health

Consumer RAID arrays are scary. Most disks in consumer drives have a specific URE (Unrecoverable Error Rate). This is normally defined from the manufacturer as being 1 unrecoverable error in 10^14 bits read. This means that 1 in every 11.3TB read there will likely be a time when you hit an unrecoverable error.

The basic premise of URE is that during read, the data cannot be read. This can be tried multiple times (even more so on consumer drives without TLER) to recover the data from the sector. If the data is unable to be read it may grab the data from elsewhere on the array and if there is no data elsewhere it may fail a rebuild.

This can be scary business when you need to read an large amount of data on hard disks in order to rebuild an array. Let's say we just have a RAID1 with 2x4TB drives, if one drive fails for whatever reason, we will have to put a new drive in its place in order to recover the array. In order to rebuild the data from the good disk to the replaced disk we must re-read the entire good disks contents (if the disk is full). Meaning there is a probability we will hit an unrecoverable error and be unable to appropriately rebuild. The chance of that happening on a 2TB drive is 2/11.3=~18% chance to hit a URE during rebuild.

Hitting a URE during rebuild can cause various issues with rebuild and in some cases may lead to catastrophic array failure that is more likely when rebuilding (RAID5/6) vs mirroring (RAID1) from what I have read (I am not an expert so take this with a grain of salt and do more research). This is due to RAID5 and 6 having to calculate parity during a rebuild, if parity calculations fail, there is a chance that the underlying array will fail.

I have heard that during normal operation some RAID arrays repair UREs transparently with redundant data so you won't get bit rot. But even so, it is recommended to monitor disks with S.M.A.R.T data and keep look out for reallocated sectors to indicate a potentially failing drive. Using smartd (smart daemon) to schedule these checks is always a good idea too.

So basically the over arching idea here is that it becomes less safe to try to implement overly-complex arrays which contain large disks. As with anything, technology reliability increases, bugs are fixed, and we eventually are able to sustain larger and more complex setups. But as a rule of thumb with RAID arrays keep it simple will likely prevent data loss and make managing your data a lot easier.

Drive Error Rate - http://arxiv.org/ftp/cs/papers/0701/0701166.pdf
Rebuilding RAID User Perspective - http://www.reddit.com/r/sysadmin/comments/1ue3yv/suggestions_for_upgrading_nas_drives/
Using SMART to predict disk failures - http://www.linuxjournal.com/article/6983
Case for not using RAID 5 - http://www.standalone-sysadmin.com/blog/2012/08/i-come-not-to-praise-raid-5/
SSD Error Rates - http://www.storagesearch.com/sandforce-art1.html 

No comments:

Post a Comment