-
David, excellent piece. Now I understand what they were saying. I only got parts of it during the discussion Tuesday -- it was easy to get lost in the complexities.
Posted By:Bert Latamore| Fri Jul 02, 2010 10:17
-
Interesting, but this just strikes me as the equivalent of a software shim; pushing the "dirty little secret" into a black box. There appear to be no mechanisms from what I can see of monitoring the unit, and the impact of masked internal failures on things like performance aren't clear to me. Isn't this just a clusterable mini array in all but name,?
Posted By:Alex| Tue Jul 06, 2010 05:51
-
Well the truth is that Seagate sought to "move up the stack" and capture higher margins with the "brick" module and storage vendors didn't buy it. It would have also eliminated and area that some of them were already innovating in - using cheap commodity hard drives with better "controller" (or filesystem) technology above it.
Also the advent of flash drives meant that customers needed to alter the mix of flash and spinning rust to better meet the application load and a hardwired brick configuration just didn't offer the breadth of QoS needed.
Posted By:Mark Carlson| Tue Jul 06, 2010 08:20
-
Continuing Mark's thoughts (disclosure: I am an EMC employee):
Not *ALL* vendors products suffer from the asserted "dirty secrets" that Lary and Sicola assert plagues the industry. In fact, many array vendors have addressed the undeniable reality that Disk Drives Fail within their architectures - some perhaps even better than the Lary_Sicola "brick".
For example:
* Symmetrix DMX and VMAX both mount their drives horizontally, not vertically (although I'm not necessarily convinced it makes that much of a difference);
* All of EMC's drives use specially designed drive carriers and Drive Array Enclosures (DAEs - roughly analguous to the L_S "brick") that have been specifically designed to eliminate the effects of Rotational Vibration Interference (RVI) - this works so well, that DMX & VMAX allow for any mixture of Flash, Fibre or SATA drives to be installed within a DAE (the L_S "brick" does not allow for drive type intermix within a "brick");
* Symmetrix builds its RAID sets across separate back-end channels AND across multiple DAEs in part to eliminate the DA/Channels as a potential Single Point of Failure (SPOF) - (another reason for this is improved performance by distributing I/O across many channels)
* Heck, we even try to ensure that drives from the same supplier lot are randomly distributed in an array, just to avoid generational or "Monday blues" failures (that's a term from the auto manufacturing industry, for those of you too young to remember when ALL cars in the US were manufactured in Detroit :-)
* Symmetrix has multiple strategies for identifying drive failures BEFORE they occur - and it does so intentionally and with purpose. From decades of experience, we recognize error patterns and behaviours that are known pre-cursors of failure, and we take pro-active action to mitigate the impact of a drive failure. This can be as simple as trying to reset/reboot the drive, to failing it and hot-sparing it well before it ceases to work.
* Whether taken out of service by the heuristics of the array, or due to an outright failure, EMC replaces the drives under warranty at no cost to the customer - just as we do for ANY component in the array (standard array warranty is 3 years).
This last point is significant: Let's be 100% clear here - while I cannot speak for every storage vendor, there is ZERO incentive for EMC nor its suppliers to provide products that fail...quite the contrary. To assert otherwise, as Lary and Sicola apparently have done, is truly disingenuous.
For reference, many of the enterprise-class disk drives EMC uses in its Symmetrix and CLARiiON arrays routinely demonstrate 1% AFR or better. The numbers of 2-13% seem more appropriate to the low-cost desktop PATA/SATA/SAS drives that Google routinely uses for its low-cost storage farms. Simply put, the drives (and drive infrastructure) that EMC uses arent' the same ones that start reporting I/O errors when you yell at them (ref. the Thumper demo on YouTube :-)
And the CERN paper referenced highlighted a different issue - the fact that there exists the possibility of silent data corruption (undetected bit rot) on disk drives. To mitigate this risk, both CLARiiON and VMAX generate Data Integrity Field for every block as they arrive in the array and stores these out on the drives themselves. These extra bytes are used to validate the integrity of every read - not only that the data hasn't changed, but also to validate that the data returned is indeed from the requested physical LBA (another common error on many spinning rust disk drives). And if there is an error, the data can be recovered from the other drives (in separate DAEs), and the now-suspect drive can be analyzed/monitored for potential replacement.
I also contest the article's asserted "issues" with SSDs - in fact, I think it demonstrates an incomplete understanding of the technology. To assert that SSDs will suffer similar failure rates overlooks the fact that they are much more similar to SDRAM than they are to spinning rust - and I've not seen anyone recently harping on the failure rates of DRAM (which goes totally unmirrored in the VAST majority of applications today - including the computer you probably use to pay your bills. The probability of an undetected bit-flip in that computer might frighten you back to handwritten checks, but that's for another day :). Virtually all viable SSD designs can tolerate the loss of an entire chip without losing or corrupting data - many (like the STEC drive EMC uses) can actually suffer multiple such failures. More importantly, these drives inform the controller of the degredation (through SMARTS, etc.), and in turn the array can take proactive preventative measures (as described above).
Finally, the probability math used "as an example" is accurate only if you assume that no attempt is made to rebuild the first failing drive. It thus predicts the probability that two drives will fail within a year, but NOT the probability that any data will be lost as a result. If the drives are appropriately protected and rebuilt in a timely manner, the probability of DATA LOSS remains infinitesimal using your example.
So, all in all, the unveiling of this "dirty little secret" by Mssrs Lary and Sicola could be considered self-serving, as they seek to promote their ISE as *the only* answer. But the secret isn't that "Drives Fail," or even that "Drives Fail More Often Than Drive Vendors Say."
The real secret is that not all storage vendors have invested in addressing these realities - and it should be no surprise that EMC has been addressing the realities of fallability for more than 2 decades.
And hats off to Lary and Sicola and the product they are selling, but the Wikibon community should know that they aren't the only solution, much less the most complete and robust solution.
Posted By:the storage anarchist| Tue Jul 06, 2010 04:51
-
“The lady doth protest too much, methinks.” ''Hamlet Act 3, scene 2, 220''
Barry, I agree with almost everything you say, and still the premise holds. Lets start with the following statements:
*The unit of replacement within an array is a disk drive.
*The AFR for any mechanical device such as a disk drive is high.
*The recovery time for a disk failure is long and getting longer as disk drives become denser
*The probability of two drives failing is a function of the square of the AFR, and three drives is a function of the cube of the AFR
*The industry standard for array warranty is three years (and SSDs are no different)
*The average life of an array is 5 years
*Human comprehension and decision making round high loss/low probability events is extremely poor, both for IT professionals and vendor professionals
*Change happens as the result of catastrophes
*Changing the unit of replacement from a drive to a brick with multiple disk drives is one way to address the issue and changes the probability of failure by at least an order of magnitude
*SSDs do not make any significant impact to the probabilities of catastrophic loss because the vast majority of data will be held on SATA drives for the foreseeable future
EMC has an excellent reputation for quality, and is doing a fine job in its high-end products to reduce the probabilities of failure of a single drive. But EMC’s ability to address the fundamentals of why disk drives fail is limited, as is its ability to “bend” fundamental technology curves. I also believe that many smart people in EMC understand the problem and are hard at work with solutions (there are many ways to skin a cat). I look forward to your Gladstonian inebriated exuberance when EMC announces it.
My bottom line premise remains supported by the statement above – the risk of catastrophic failure is increasing rapidly, and disk professionals in IT shops and at vendors are not facing up to that risk. You yourself are disingenuous by declaring “the probability of DATA LOSS remains infinitesimal” – that statement is patently misleading.
Thanks for your contribution to this debate, Barry - it is a very important one.
Posted By:David Floyer| Wed Jul 07, 2010 12:41
-
Dsvid -
While I appreciate the value of debate, I contest the implied position of this paper that ONLY the "brick" addresses the issues raised. My intent is to demonstrate that this premise is patently misleading (to use your word).
More importantly, the accusations from Lary and Sicola that vendors chose not to purchase their investion from Seagate is an absolute mistruth. I was in the room when EMC explained our reasons to Seagate, and I can assure you that for at least one vendor the maintenance revenue had NOTHING to do with that decision. However, I will adhere to the Buyer/Supplier NDA and say no more about that matter.
The real "direty secret" is that embedding 15 non-removable disk drives and a RBOD disk controller inside a "brick" and selling it as the equivalent of a 10-drive (usable) device does not inherently nor automatically make it any more reliable than putting 15 drives into a DAE under the control of external controller hardware and software. Same number of components, each with same MTBF put together in a similarly intelligent manner will yeild the same projected AFR. In in fact, if EMC simply left failed drives in the arrays rather than replacing them, its recorded AFR would be lower (remember, when a device fails inside the "brick" it is still a failure, except that the failed part is not removed and thus not part of the AFR calculations).
With the "brick" the asserted availability comes at a cost: of performance and flexibility (in addition to the cost of the "spare" drives and RBOD controller); with the external DAE, the same (or better) availability is delivered without the sacrifices and with the added flexibility to mix and match drives to meet varied workloads.
And with the brick, it itself becomes a unit of potential failure; losing a brick subjects the customer to data loss 10x larger than the loss of an individual drive. While L&S will argue this is rare, they cannot prove it will never happen - so your argument about "catastrophic" loss potential applies to their solution as well.
That said, I don't know what your definition of "catastrophic" is, but the fact that today's drives are 10-20x as big as those 10 years ago does not make their failure 10-20x as damaging. Losing 50GB of a database is no better than losing 500GB - you will have to recover the ENTIRE database either way. So I think you may be exxagerating the practical implications just a bit with the use of the word "catastrophic."
I will concede the bottom line point, though - if people are relying on RAID protection of disk and solid-state drives as their only insurance against data loss, then they are playing Russian Roulette with a Colt 45. No matter HOW reliable the implementation, drives WILL fail, and sooner or later backups WILL be required to recover from said failures. Too many unsuspecting shops run with no separate physical backup of their data. Remote replication, isolated/insulated backup-to-disk targets and even the Cloud are all critical components of a true BC/DR plan, and the "brick" provides none of these important infrastructure components.
My recommendation would thus differ from yours: Make sure you and your storage vendor have architected for optimal performance, flexibility and availability. Then spend 4x more time making sure your backups and BC/DR plans are comprehensive and functional. Murphy's Law cannot be subverted: you WILL lose data sooner or later, and you WILL need a reliable recovery plan.
Posted By:the storage anarchist| Wed Jul 07, 2010 08:09
-
I find myself with Barry on this one. As NetApp practices many (all?) of the same techniques employed by EMC to reduce the risk of disk failure impacting data availability and reliability, I really can't see the advantages to reducing and bundling up the AFR into what is, after all, no more than a black-boxed shelf. It doesn't improve the lifetime of a single drive, the inefficiencies are amazing, the performance variable and declining over the lifetime of the unit, and to be frank it's no more than a not-very-smart array that looks like a drive.
And again I'm with Barry on the insurance against data loss aspects; ISE isn't a sensible substitute for backup & recovery.
Disk professionals at vendors -- well, some -- are facing up to the individual disk risk by addressing it at the array level, not the disk level. As the AFR of drives increases (a debatable point), the AFR of a good quality array has gone down significantly, by any measure (per disk or per TB).
In fact, I would go as far as to suggest that the AFR of a good quality array far exceeds that of ICE, TB for TB.
Posted By:Alex| Wed Jul 07, 2010 08:46
-
Excellent points Barry. I think we agree on alot - particularly the conclusion you put forth which is definitely consistent with the action item I proposed. I think you said it better than I.
Interesting concept of what EMC's AFR would be if you just left the drives in place. Would like to explore that more.
The questions become then 1) what is the best way to solve the problem - as floyer and you say there are other ways than a brick and 2) what's the best way to achieve the right combo of reliability, cost and performance.
We'd be happy to host another session on this topic and invite a broader perspective than the narrow one put forth by L&S. This is a complicated topic and probably needs more discussion. Are you or your colleagues game?
Alex - we'd welcome your input as well. Barry shared roughly EMC's AFR's - what is NetApp seeing in the field?
Posted By:David Vellante| Wed Jul 07, 2010 08:51
-
"In fact, I would go as far as to suggest that the AFR of a good quality array far exceeds that of ICE, TB for TB."
Okay...Let's get on another call and have that discussion. If you guys (Barry/Alex) are game I will ping L&S to see if they'll do a "Part Deux"
Are you guys in?
Posted By:David Vellante| Wed Jul 07, 2010 08:54
-
I love a challenge, but let me get some hard facts at my disposal first before I agree to this.
Also bear in mind that I'm not the expert on disks (but I'll argue stats until the cows come home). So I'm not party to disk AFRs that NetApp sees in the field; I'd need to do some digging, but I would be confident they would reflect Barry's numbers.
We also need a solid definition of what is meant by a failure; is it failure with data loss? (That's appears to be the traditional measure of disk AFR.) Failure with downtime, but no data loss? Shelf failure? Single or clustered arrays?
Posted By:Alex| Wed Jul 07, 2010 09:11
-
A small correction; when I said "In fact, I would go as far as to suggest that the AFR of a good quality array far *exceeds* that of ICE, TB for TB." I actually intended to say the *reliability* of a good quality array etc...
Posted By:Alex| Wed Jul 07, 2010 09:14
-
David -
Today's edits are a constructive improvement to the document.
However, and though I admit I don't know what Lary or Sicola actually said, but repositioning their assertion that vendors saw their ISE's as a "competitive threat" instead of "competitive threat to maintenance revenues" is still misleading IMHO. As the comments here show, it is probably more likely that vendors saw/see the ISE as an unecessary cost-adder; since the ISE's cannot be guaranteed infallible, vendors would have to RAID *across* the ISEs as well, further increasing the unusable capacity ratio and driving up the $'s per usable GB.
Now, if that's what they actually said, fine. But I don't think all that really has any bearing on the article, other than to support the revenue and competitive interests of Xiotech. In fact, looking at the article now, it might have been more impactful and beneficial to the community had it not delved into anyone's mitigation approach at all.
Your new edits also imply that AFR is being withheld by we who are now engaged. I provided that the AFR for EMC's drive fleet is around 1% - below that asserted in the article.
As for system-level AFR, I'll follow Alex' lead and ask that we first define what we mean by system level AFR. I'll cast my vote that the AFR metric should include both Data Loss *AND* Data Unavailable events that are attributable to the storage array (eg exclude non-array impacts such as loss of power, broken FC cables, improperly configured switches, lack of alternate/multi-pathing SW, operator error, planned downtime and the like).
With a fair and proper definition, I'll agree with Alex that *ANY* quality array should EASILY deliver a better AFR than the 0.1% claimed for the ISE.
Seriously - isn't that what Five Nines (or better) of availability is all about?
Posted By:the storage anarchist| Wed Jul 07, 2010 04:56
-
Thanks again Barry...
1/ Full audio link posted in the footnote of this article when you get a minute (or sixty).
2/ Tried to clarify the seagate intention language. Seagate has public said it shed the unit because it didn't want to compete w/ its customers. Lack of traction could have something to do with the decision as well.
3/ Cited EMC's AFRs directly in the piece - thanks for sharing that. Would love to see similar data from NetApp and other vendors.
I don't really agree that the Xiotech solution shouldn't be put forward - it's an approach that should be examined as this article and its comments do.
I'm sure the Wikibon community would love to hear about other approaches to this problem...or have Sicola and Lary back to defend their assertions with a wider audience.
Ping me and I'll schedule it.
Posted By:David Vellante| Wed Jul 07, 2010 11:37
-
Barry/Alex
The importance of this Peer Incite is highlighting innovation. The premise for the innovation - drive reliability is a problem (1% or greater per year per drive.) The innovation is that the unit of replacement is a brick and not a drive. The only company delivering that innovation is Xiotech. The speakers have a rock solid reputation in the industry for innovation.
The first question to ask is - is that innovation important? Lets just assume the vendor claims are absolutely true. Lets take a Petabyte array with 1TB SATA drives.
The array with 1,000 drives individual drives and 1% drive failure rate will have 10 drive failures every year.
The array with 100 bricks and .1% brick failure rate will have 1 failure every ten years.
Is this innovation important? Nobody can doubt it.
Will EMC and NetApp have their version of a brick (maybe cinder block?) - very likely.
Is the brick (or cinder block) an absolute necessity today? No. Users can achieve the RPO and RTO requirements of today with their current arrays.
Are their other features within the array that are also important to achieving RPO and RTO? Absolutely.
Do EMC and NetApp have more of these features than Xiotech? Absolutely.
Did this Peer Incite illustrate important technology trends? Yes; current RAID technologies are or will be insufficient going forward, and future arrays are very likely to have a much larger unit of replacement.
Actions: Dave Vellante, the only changes you need to make to the pieces are to clarify the key innovation (brick as a replaceable unit) and the key benefits, and take out ascribed vendor motivations. Barry/Alex - you both belong to companies that have brought innovation after innovation to the marketplace. Luddite marketing damages that reputation. Be gracious to others in the industry that bring innovation. You both know that when this is introduced by your companies, you will be marketing the c*** out of it!
Posted By:David Floyer| Thu Jul 08, 2010 10:14
-
David -
Did you just call Alex and I Luddites?
Seriously?
Kind sir, I would hope we can carry on constructive conversations without resorting to name-calling.
And as to your mythical math - it matters not whether the drives are in an array or in a "brick" - you will get 10 failures (or more) per year either way.
That the ISE hides these failures within is commendable (as I acknowledged earlier). But that comes at rather significant cost - the ISE customer buys 1500 drives to get the same raw capacity (the ISEs use 15 HDDs to provide 10 HDDs of "reliable" storage)...and since the failure rate of the ISE is not ZERO, these 100 ISEs must also be RAIDed, netting the same usable capacity as 1000 drives. ISE performance also degrades as drives fail within the ISEs, adding to the cost, while the single-drive model maintains constant performance over the life of the array.
Drive-based arrays acheive equal or better availability despite the drive failure rate, and since customers don't have to pay for replacements for at least part of their ownership period (depending on their tech refresh and lease rotation cycles), I maintain that the Xiotech ISE is merely an alternative approach - and one that falls far short of the "Holy Grail" you seem intent on bestowing upon it.
As to your assertions that EMC and NetApp are both building similar products, well, I have no clue where you got that impression. To my knowledge, nothing has been publicly stated about such a development for either Symm or CLARiiON. Would you mind citing your source(s) for us?
Posted By:the storage anarchist| Thu Jul 08, 2010 11:00
-
Well, at least as a Luddite I'm in fairly good company. I don't speak for what NetApp may or may not do in the future, but right now I can assure you that this is not a technology that has legs. There are several points, some of which Barry has noted, that ensure we won't be doing anything with ISE soon. From my perspective;
1. NetApp's WAFL goes to extraordinary lengths in placing data on drives. Performance, both write and subsequent read, is improved dramatically by good data placement, and even more so when disks are nearing capacity. ISE has a decreasing performance profile as disks fail in the unit; not good, especially as used capacity will normally be reached as end of life is approached.
2. A featureless brick (API notwithstanding) has no attraction when it can't add value beyond a bare drive. They don't provide compression (although they could), encryption (ditto), deduplication (but at that AFR I'd run a mile), clustering and so on. Replication is up to the application and at the application level, something I would contend is just plain wrong. These are features where the array designer and manufacturer leads, and the disk drive manufacturer (and ISE) may follow, if at all.
3. ISE has the dubious benefit of being no more reliable than RAID-6. It appears to be significantly less so. An improvment of 10x on AFR of a *single* disk drive to 0.1% doesn't seem that remarkable when RAID-6 is 1000x better than RAID-5, which is itself an improvement of orders of magnitude over a single unprotected drive. Sophisticated and useable (n-k) parity schemes are already available, EMC's Atmos and Cleversafe being two that I know of. And that's a place where ISE can't follow.
The only bright spot with ISE is the reduced number of visits to the data center to replace drives. Given that it requires a purchase of 50% more disks and space than you need (and possibly more if they're traditionally RAIDed), and requires they be powered and cooled for 5 years, I wish you luck making the numbers work out.
Posted By:Alex| Thu Jul 08, 2010 12:28
-
I know it's a rare value we're providing to users when we can get Alex and Barry to agree :-)
This is a tangent Alex..but isn't your claim about WAFL improving performance from good data placement especially when disks are reaching capacity a function of the garbage collection penalty inherent in any log structured file system? Maybe we should start another thread for that discussion.
Actually - we already have one:
http://wikibon.org/wiki/v/WAFL_Performance
Yeah David - easy on the British name-calling dude - still smarting over the July 4th holiday?
Once again Alex I think you've identified a key point of the issue which is the api business model. Not sure everyone agrees (i.e. buyers) that replication in the application is just plain wrong. It depends on a number of factors I won't go into here.
I really think we should have this discussion again with Lary and Sicola and come to some conclusions we can agree on. The discussion we had before was VERY narrow and this conversation widens the scope.
Are you guys willing to do that?
Posted By:David Vellante| Thu Jul 08, 2010 12:58
-
So the issue that's come forth as I see it is this. Is this a problem for CERN and Google and not for vendors that have taken alternative approaches.
Based on the data Barry supplied, Symmetrix, for example, does not suffer from the delta between field AFRs and published specs. There may be others as well and we'd encourage the publishing of AFR definitions and data.
Looking forward to furthering the discussion.
Posted By:David Vellante| Thu Jul 08, 2010 02:50
-
Excellent conversation! I see the topic has generated some interest - meaning, Steve and Richie hit some hot buttons :-)
First, hats off to Barry and the other posters for chiming in. That's what makes Wikibon a good venue - highly respected folks like Barry posit their views. Thank you again.
Now, some specific responses, in rough (chronologic) order.
*DAEs are not analogous to the ISE, in truth. About the only thing they share is the fact they enclose HDDs. After that, the differences are striking.
*The ISE does indeed allow intermix within itself - e.g. one datapac of 10x300 GB/15K drives and one datapac of 10x600GB/10K RP (reduced power) drives in the same ISE.
* Symm does indeed build RAID sets as Barry described - so does the Emprise 7000 stripe across multiple datapacs and multiple ISE across multiple channels.
* In fact, we use fabric connections to ISEs, not arbitrated loop as do many other vendors.
* Symmetrix does a nice job with SMART data, but that only takes one so far. To be optimal, one must write their own disk firmware - which we do. That way we get access to much more data than SMART can project. Active telemetry - drives actually conversing with intelligent controllers - goes far beyond SMART. Besides, guys like Richie and Steve know just a bit about disk drives as well :-)
* Barry is correct - their standard warranty is 3 years. Xiotech's standard warranty is 5 years. As one of the cable TV networks is fond of saying, "you decide."
* I agree with Barry that vendors do not purposely design systematic failure as a motive for profit. But the fact remains that human intervention is required in legacy designs in drive failure cases - over half of which are false positive, according to Seagate's own data - and the human intervention means opportunity to charge endusers for same.
* The CERN paper did indeed highlight silent data corruption. However, the technique Barry correctly describes is not enough. CRC'ing the data and storing that CRC is good - but that's only one of the three checks ANSI T10 DIF performs - which the ISE has built-in. T10 DIF also verifies both the LBA virtual to physical address translation and the LUN identification. In other words, you may have good data, but did you write it to the correct location? Munging the address is just as bad as munging the data.
* I also agree that very high-end SLC drives have better operating characteristics than low-end SLC or many MLC drives today. In my experience, very few datacenters can afford high-end SLC, and furthermore, placing them on arbitrated loops is (in my 33-year industry opinion) is sub-optimal design. There are reasons why SLC drives are limited by the vendor as to how many can reside on the same loop or the same BOD - and if I designed an array with SLC drives on loops, I'd do the same. Personally, I'd design an array with SLC drives on fabric, not loop.
* We do not embed 15 drives in an ISE - we embed 20 or 40, depending on HDD form factor (3.5" or 2.5"). 15 drives in 3U is not optimal density. We do 40 drives in 3U.
* An entire device failing within an ISE is indeed still a failure, but the fact is that entire drives rarely fail inside an ISE. We perform head-level mapping and sparing, and a frequent use case is that just one head is 'out-of-whack', so why fail an entire drive? We don't, and therefore significantly reduce the risk of a drive entire failing. Besides, even if it did, we invoke recertification processes (as Steve and Richie outlined) to bring the drive back to health, just as one would on the bench, if at all possible.
* The 5-year warranty says it all...I challenge other vendors to put up a 5-year warranty on all hardware (not just their enclosures) for $0, for all customers - not just the ones they give heavy discounts to.
* One ISE is indeed a single point, as Barry correctly says. This is why we have our Emprise arrays use multiple ISE, just as we pioneered the technique of drive bay redundancy many years ago. Plus, many hypervisors/OS and even applications (think: Oracle ASM) manage their own volumes, and can thus manage any individual enclosure risk, be it JBOD or ISE.
* Barry makes great points on the B/C, D/R and the need to backup.
* The drive AFR of a good quality array is on the order of 0.7. Barry mentions 1.0 as a standard figure, and he is correct. However, current datapacs are currently running at .017 for not an AFR but a _5 year FR_. If you want to put that in AFR terms, that's 0.0034, or 34 drives per 10,000 per year. Compare to 0.7, or 7 drives per 1,000 per year. Our figure represents nearly a 250x increase in reliability (0.7/0.0034)
* Vendors don't have to RAID across ISE - in fact, best practice from one well-known vendor (behind their virtualization engine) is to use RAID on the LUNs served up by the backend array, and use RAID-0 at the virtualization engine 'front-end'. In other words, don't RAID on RAID, let the backend array protect the data. Which, as it turns out, ISE does exceedingly well at :-)
* We do need a good definition of system-level AFR. Since the ISE is a full system unto itself, though - unlike any BOD - our AFR stands.
* One must absolutely include planned downtime (a great oxymoron) in AFR calculations. Planned or not, if you lose access to data, that's an outage, the system is unavailable.
Finally, again, hats off to Wikibon for airing the podcast and article. This is very good discussion. Sure, there are other ways to 'solve' the problems - if you throw enough resources ($$, silicon, connectivity, FTEs and time) but ISE is clearly innovative and solves many problems very cost-effectively. I liken it to fuel injection, invented by Hilborn in 1947 and commercially developed by Bosch in 1952. it took quite a while to 'catch on' in production vehicles. ISE is the same way - it's different but inherently the right play for HDDs. Yet, many insist on constructing arrays with carburetors (BODs and loops). SAS helps here, but SAS BODs are inherently no more reliable than FC BODs. Sure, you can build a very nice, fast car with multiple carbs, but FI is clearly an optimal mechanism.
Our next debate will be over management mechanisms - we use REST. I urge others to do so as well, for the benefit of the endusers. Or, perhaps on per-spindle random I/O mixed-read/write performance on HDD, which is derivable from our published SPC-1 figures. But that's another discussion for another day :-)
Posted By:Rob Peglar| Thu Jul 08, 2010 02:52
-
Sorry, left out a '0' in my figure above. 0.0034 % is 34 drives per 100,000, not 34 per 10,000. I am typing too many zeroes :-)
Posted By:Rob Peglar| Thu Jul 08, 2010 03:25
-
My goodness, is there no limit to the length of a reply??? I'm going to try and address these from a NetApp perspective one by one where there's something to say. Barry will no doubt be along shortly; and David, I suspect that us agreeing is simply a factor of us both believeing we're right on this one.
Rob's made a number of assertions where I've indicated he needs to complete the thought. They're marked And...? or Because...?
*The ISE does indeed allow intermix within itself - e.g. one datapac of 10x300 GB/15K drives and one datapac of 10x600GB/10K RP (reduced power) drives in the same ISE.
Just like an array.
* Symm does indeed build RAID sets as Barry described - so does the Emprise 7000 stripe across multiple datapacs and multiple ISE across multiple channels.
The Emprise is an array, not ISE; that's a bit of a diversionary tactic, and striping is not data protecting RAID.
* In fact, we use fabric connections to ISEs, not arbitrated loop as do many other vendors.
And...?
* Symmetrix does a nice job with SMART data, but that only takes one so far. To be optimal, one must write their own disk firmware - which we do. That way we get access to much more data than SMART can project. Active telemetry - drives actually conversing with intelligent controllers - goes far beyond SMART. Besides, guys like Richie and Steve know just a bit about disk drives as well :-)
NetApp does not use SMART data; the firmware on the drives is ours, and goes far beyond SMART.
* Barry is correct - their standard warranty is 3 years. Xiotech's standard warranty is 5 years. As one of the cable TV networks is fond of saying, "you decide."
Fair point. The warranties are comensurate with the replacement & upgrade cycles we see in the field. But warranties don't make disks or ISEs inherently more reliable; they're the insurance policy, and this is a marketing point, not a technology point.
* I agree with Barry that vendors do not purposely design systematic failure as a motive for profit. But the fact remains that human intervention is required in legacy designs in drive failure cases - over half of which are false positive, according to Seagate's own data - and the human intervention means opportunity to charge endusers for same.
Dealt with later. NetApp, EMC and Xiotech don't design for failure as a given, so I don't quite get the point of your agreement with Barry. False positives we'll get back to.
* The CERN paper did indeed highlight silent data corruption. However, the technique Barry correctly describes is not enough. CRC'ing the data and storing that CRC is good - but that's only one of the three checks ANSI T10 DIF performs - which the ISE has built-in. T10 DIF also verifies both the LBA virtual to physical address translation and the LUN identification. In other words, you may have good data, but did you write it to the correct location? Munging the address is just as bad as munging the data.
This is key to NetApp protection of data. Not only do we CRC, but we also detect lost and misplaced writes and correct them.
* I also agree that very high-end SLC drives have better operating characteristics than low-end SLC or many MLC drives today. In my experience, very few datacenters can afford high-end SLC, and furthermore, placing them on arbitrated loops is (in my 33-year industry opinion) is sub-optimal design. There are reasons why SLC drives are limited by the vendor as to how many can reside on the same loop or the same BOD - and if I designed an array with SLC drives on loops, I'd do the same. Personally, I'd design an array with SLC drives on fabric, not loop.
Because...? The backend isn't the limit; it's the customer's wallet. Why grossly over-engineer for a 3 year lifecycle? We don't currently supply our own SSD, and I'm unaware of what technology we'll use. We're big believers in flash as cache though; we've sold 1PB so far.
* We do not embed 15 drives in an ISE - we embed 20 or 40, depending on HDD form factor (3.5" or 2.5"). 15 drives in 3U is not optimal density. We do 40 drives in 3U.
Noted.
* An entire device failing within an ISE is indeed still a failure, but the fact is that entire drives rarely fail inside an ISE. We perform head-level mapping and sparing, and a frequent use case is that just one head is 'out-of-whack', so why fail an entire drive? We don't, and therefore significantly reduce the risk of a drive entire failing. Besides, even if it did, we invoke recertification processes (as Steve and Richie outlined) to bring the drive back to health, just as one would on the bench, if at all possible.
NetApp ditto; the recertification takes place on failed rives, and if they;re OK, back into service they go. Do it again, though, and we RMA them.
* The 5-year warranty says it all...I challenge other vendors to put up a 5-year warranty on all hardware (not just their enclosures) for $0, for all customers - not just the ones they give heavy discounts to.
They 5 year warranty doesn;t make the ISE or disks any more reliable. Let's get the technology out of the way, then you get bragging rights.
* One ISE is indeed a single point, as Barry correctly says. This is why we have our Emprise arrays use multiple ISE, just as we pioneered the technique of drive bay redundancy many years ago. Plus, many hypervisors/OS and even applications (think: Oracle ASM) manage their own volumes, and can thus manage any individual enclosure risk, be it JBOD or ISE.
If you really know as much as you claim about disks, you'll recognise the efforts it takes to make these most intrasigent of devices behave. OS and apps (inlcuding ASM with its 2/3/4 mirror protection) do just enough and no more, and depend to a large degree on the disk behaving itself, or the mirrors, or the RAID card. Managing your own disks is about as sensible as building your own nuclear power plant.
* Barry makes great points on the B/C, D/R and the need to backup.
Agreed.
* The drive AFR of a good quality array is on the order of 0.7. Barry mentions 1.0 as a standard figure, and he is correct. However, current datapacs are currently running at .017 for not an AFR but a _5 year FR_. If you want to put that in AFR terms, that's 0.0034, or 34 drives per 10,000 per year. Compare to 0.7, or 7 drives per 1,000 per year. Our figure represents nearly a 250x increase in reliability (0.7/0.0034)
The AFR of a good quality *disk* might be 0.7, but a good quality *array* can deliver 5 9s of uptime; we're heading for 6 9s. RAID-6 is orders of magnitude better than 250x. It's 1000x better than RAID-5, which is 100x better than a single disk.
* Vendors don't have to RAID across ISE - in fact, best practice from one well-known vendor (behind their virtualization engine) is to use RAID on the LUNs served up by the backend array, and use RAID-0 at the virtualization engine 'front-end'. In other words, don't RAID on RAID, let the backend array protect the data. Which, as it turns out, ISE does exceedingly well at :-)
Only if your AFR is miniscule,and it isn't. With literally *exabytes* of data out there, we lose a vanishingly small fraction of that due to failure.
* We do need a good definition of system-level AFR. Since the ISE is a full system unto itself, though - unlike any BOD - our AFR stands.
Not clear on what stands here.
* One must absolutely include planned downtime (a great oxymoron) in AFR calculations. Planned or not, if you lose access to data, that's an outage, the system is unavailable.
No you don't! Since when does the F in AFR stand for anything other than failure? Pick another name or another metric.
Now I'm exhausted. Over to you Barry.
Posted By:Alex| Thu Jul 08, 2010 04:55
-
Whew! Is anybody except us actually following this?
In (some sort of) order:
* Symm also uses custom drive firmware, for many the same purposes and reasons described here - and then some. *DRAW*
* Symm indeed does more than check the CRC - it uses the full T10DIF capabilities to validate both the data and the LBA. And we provide T10DIF for SATA drives as well, even though they cannot be reformatted to store the DIF bytes alongside the data blocks...we may be unique in this dimension, but I'll still call it a *DRAW*
* I despise the repeated attempts to assert that replacing failed drives equates to charging customers for NTF drives. If the customer never sees a bill for the visit, the point is moot. *DRAW*
* Rob can argue theory and design of SLC SSD deployments, but EMC has likely shipped more array-based SSDs to date than all competitors combined. Further, the adoption rate contradicts Rob's assertion that noone can afford them. *+1 VMAX*
* 15, 20, 40 drives in an ISE. Matters not - is always more raw capacity than usable. Even # of drives in 3U is irrelevant - any one of us could package more drives in a 19" standard rack than a raised floor can support. "REPLAY*
* At EMC, we do not to reuse drives that have been declared suspect and that the array could not recover on its own (yes, indeed, we do attempt drive resescutation for many error conditions, just like you guys). *DRAW*
* As Alex says, a 5 year warranty says nothing about the quality or uniqueness of the solution. And indeed, Xiotech is not the only supplier to offer 5 year warranties on the entire array. *DRAW*
* I said the drives we use have an AFR of around 1%. And I stand with Alex - we do not have a common definition of SYSTEM AFR. Internally at EMC, we treat "Failure" as *ANY* loss of access to data directly attributable to the array (including planned downtime required for service - which should always be ZERO for Symmetrix).
By the way, you (and others) tend to confuse *Availability* with *Reliability*
VMAX is already delivering 6 9's of availability, and in Symmetrix parlance that indeed means no loss of access to ANY data in the array for more than (what is it - less than 60 seconds per year?). AFR and service costs be damned - the customer measures by the availability of his/her data, not Mean Time Between Parts Replacement or AFR.
Adding up the points (yes, I stopped scoring), one can only conclude that EMC, NetApp and Xiotech each have invested lots in protecting the integrity and availability of their customer's data. BUT, it is also clear that whatever innovations and differences there may be on each platform, it is insufficient to declare any one significantly "better" in every dimension.
And the one point we ALL agree upon is that NONE of this gets ANYONE out of doing backups. Components WILL Fail, You WILL Lose Data, and you Better Have A Plan (or at least, 3 envelopes for your replacement :-).
Posted By:the storage anarchist| Thu Jul 08, 2010 06:52