In this article, we discuss why RAID 1+0 (stripes of mirrors) is better than RAID 0+1 (mirror of stripes) for those who are building a storage array.
What is RAID, and why?
RAID (redundant array of independent disks) describes a system of arrangements to combine a bunch of disks into a single storage volume. The arrangements are codified into RAID levels. Here is a summary:
- RAID 0 (striped) essentially concatenates two or more disks. The stripe refers to the fact that blocks from each of the disks are interleaved. If any disk fails, the whole array fails and suffers data loss.
- RAID 1 (mirrored) puts the same data across all disks. It does not increase storage capacity or write performance, but reads can be distributed to each of the disks, thereby increasing read performance. It also allows continued operation until the last disk fails, which improves reliability against data loss (assuming that disk failures happen independently, see discussion below).
- RAID 2, 3, 4, 5 (parity) employ various parity schemes to allow the array to still operate without data loss up to one disk failure.
- RAID 6 (parity) uses two parity disks to allow up to two disk failures without data loss.
- RAID-Z (parity) is a parity scheme implemented by ZFS using dynamic stripe width. RAID-Z1 allows one disk to fail without data loss. RAID-Z2 two disks, and RAID-Z3 three disks.
Disk failures may be correlated by many factors—operational temperature, vibration, wear, age, and manufacture defects shared by the same batch—so they are not fully independent. However, the failures can be somewhat randomized if we mix disks from different vendors or models. Each year, Backblaze publishes hard drive failure rates aggregated by vendor and model (e.g. 2023) which is an invaluable resource to decide which vendor and model to use.
Why use nested RAID?
When disk fails in a RAID up to the tolerated failure without data loss, the failed disk can be replaced with a blank disk, and data is resilvered into the new disk. The resilver time lower bound is determined by the capacity of the disk divided by how fast we can write to it. A larger disk will take longer to resilver. If the disk failures are not fully independent, then there is a greater risk of data loss if another disk fails during resilvering.
By way of example, a 24TB disk at a maximum write speed of 280 MB/s will take about a day to resilver (rule of thumb: each 1 TB takes 1 hour). This makes a large disk somewhat unappealing due to the long resilver time. More realistically, we may want to use smaller disks so they can complete resilvering more quickly.
Creating RAID from smaller capacity bound disks necessitates nesting the RAID arrangements. Here are some ways we can arrange the nesting:
- RAID 0+1 (a mirror of stripes) takes a stripe of disks (RAID 0) and mirror the stripes (RAID 1). When a disk fails, it impacts the entire stripe, and the whole stripe needs to be resilvered.
- RAID 1+0 (a stripe of mirrors) takes a bunch of mirrored disks (RAID 1) and creates a stripe (RAID 0) out of the mirrors. When a disk fails, it impacts just the mirror. When the disk is replaced, only the mirror needs to be resilvered.
- RAID 5+0, RAID 6+0, etc. creates an inner parity RAID combined as an outer stripe (RAID 0).
Nesting are not created equal!
We can see qualitatively from the failure impact point of view that RAID 1+0 has a smaller impact scope, therefore is better than RAID 0+1. But we can also analyze quantitatively what is the probability that a second failed disk might cause total data loss.
- In RAID 0+1, suppose we have N total disks arranged into a 2-way mirror of \( N / 2 \) disk stripes. The first failure already brought down one of the mirrors. The second failed disk would bring down the whole RAID if it happens in the other stripe, or approximately 50% chance. Note that we cannot simply swap striped disks between the mirrors to avert the failure, since disk membership belongs to a specific RAID. You might be able to manually hack around that by hex-editing disk metadata, but that is a risky dealing for the truly desperate person.
- In RAID 1+0, suppose we have N total disks arranged into \( N / 2 \) stripes of 2-way mirrored disks. The first failure already brought down one of the mirrors. The second failed disk would bring down the whole RAID if it happens in the other disk of the same mirror, or approximately \( 1 / N \) chance. The number of failed disks we can tolerate without data loss is a variation to The Birthday Paradox.
Deep dive into aggregate probability of failure
Another way to analyze failure impact quantitatively is to see what happens if we build a RAID 0+1 or RAID 1+0 array using disks of the same average single disk failure rate. We compute the aggregate probability of failure that would suffer data loss, assuming that failures are independent.
- Suppose the probability of average single disk failure is p.
- For RAID 0 with N striped disks, the probability that any of the N disks fails is \( P_0^N(p) = 1 - (1 - p)^N \), i.e. the opposite of all N disks do not fail, a double negation.
- For RAID 1 with N mirrored disks, the probability that all N disks fail at the same time is \( P_1^N(p) = p^N \)
- For RAID 0+1, with k-way mirrors of \( N / k \) stripes, the failure probability is:
\[
\begin{aligned}
P_{0+1}^{k,N/k}(p) & = P_1^{k}(P_0^{N/k}(p)) \\
& = P_1^k( 1 - (1 - p)^{N/k} ) \\
& = ( 1 - (1 - p)^{N/k} )^k
\end{aligned}
\] - For RAID 1+0, with \( N / k \) stripes of k-way mirrors, the failure probability is:
\[
\begin{aligned}
P_{1+0}^{N/k,k}(p) & = P_0^{N/k}(P_1^k(p)) \\
& = P_0^{N/k}(p^k) \\
& = 1 - (1 - p^k)^{N/k}
\end{aligned}
\]
We can plot \( P_{0+1}(p) \) and \( P_{1+0}(p) \) and compare them against p, the probability of single disk failure representing the AFR (annualized failure rate). Here are some examples of p from Backblaze Drive Stats for 2023 picking the models with the largest drive count from each vendor:
MFG | Model | Drive Size | Drive Count | AFR |
---|---|---|---|---|
HGST | HUH721212ALE604 | 12TB | 13,144 | 0.95% |
Seagate | ST16000NM001G | 16TB | 27,433 | 0.70% |
Toshiba | MG07ACA14TA | 14TB | 37,913 | 1.12% |
WDC | WUH721816ALE6L4 | 16TB | 21,607 | 0.30% |
In most cases, we are looking at p < 0.01. Some of the models have AFR > 10% but it is hard to tell how accurate the number is due to the small drive count for that model. Those models with a small average age can bias towards high early mortality due to the Bathtub Curve.
We generate the plots in gnuplot using the following commands:
gnuplot> N = 4 gnuplot> k = 2 gnuplot> set title sprintf("N = %g, k = %g", N, k) gnuplot> plot [0:0.02] x title "p", (1 - (1-x)**(N/k))**k title "RAID 0+1", 1 - (1 - x**k) ** (N/k) title "RAID 1+0"
Number of disks N = 4 with mirror size k = 2 is the smallest possible nested RAID 0+1 or RAID 1+0. At p < 0.01, both RAID 0+1 and RAID 1+0 offer significant improvement over p.
At N = 8 while keeping the same mirror size k=2, we see that both RAID 0+1 and RAID 1+0 still offer improvements over p. However, RAID 1+0 failure rate doubles, and RAID 0+1 more than triples.
At N = 16, the trend of multiplying failure rate continues. Note that RAID 0+1 can now be less reliable than a single disk, while RAID 1+0 still offers 6x improvement over p.
To get the failure rate under control, we need to increase the mirror size to k = 3. Even so, RAID 1+0 failure rate (very close to 0) is still orders of magnitude lower than RAID 0+1.
So from the quantitative point of view, RAID 1+0 is much less probable to suffer a total data loss than RAID 0+1.
Conclusion
As a rule of thumb, when nesting RAID levels, we want the striping to happen at the outermost layer because striping is the arrangement that accumulates failure rates. When we mirror for the inner layer, we reduce the failure rates by orders of magnitude, so the striping accumulates failure rates much slower.
This conclusion also applies to storage pool aware filesystems like ZFS. RAID-Z does not allow adding disks to an existing set. The best practice is to always add disks as mirrored pairs. You can also add a new disk to mirror an existing disk in a pool. Stripes of mirrors offer the most flexibility to expand the storage pool in the future.
This is why RAID 1+0 is better.
No comments:
Post a Comment