As a young account manager, I was sitting late at night in the data center with the acting CIO of a retail company reviewing an order-fulfillment project from hell. The project had driven the CIO to a nervous breakdown and he was on sick leave. The project had tight deadlines and the software was new and unstable. Nobody had made an obvious mistake, but things had gone badly wrong and the system was being recovered for the third time. We were on the very last recovery tape. If this did not work, we could not recover the system. We were reviewing our options if the backup failed. We discussed how to re-enter the data, how many people it would take and how long it would take. We discussed how to present this to the CEO. I remember the fear in the pit of my stomach that the company would probably go under. The last backup worked, the CIO recovered and has been a lifelong friend. I have always used that fear-factor to guide me assessing project risk.
I was reminded of this incident a few years ago when I was sitting with another acting CIO from an insurance company and discussing the results of a failed project that had caused his predecessor to be fired. The project was to replace a bespoke system with packaged software and reduce costs by $20 million. They had cut over, but the new system could not provide the functionality, and they had to employ a large number of people to keep the system from failing. The project failure had cost the company nearly $0.5 billion in lost revenue and business opportunity. The acting CIO mused on the fact that the failure of the IT project had very nearly caused the company to go under and what could have been done to reduce project risk. I felt that fear-factor again in the pit of my stomach.
In both of these cases, everybody had good intentions. Both projects were reviewed, and the payback seemed worthwhile. There were no obvious big mistakes, just a series of “unlucky” events. Human beings are very poor at judging the impact of unlikely risks that have big consequences – Nassim Nicholas Taleb’s “black swans”. The recent economic collapse was a direct result of failing to take into account the high consequence of a series of low probability risks. Adding to that the fact that they could take these bets and be insured with tax-payers money against disaster only compounds the problem.
Richie Lary and Steve Sicola of Xiotech Corporation spoke persuasively to the June 30 2010 Peer Incite about the growing risk of disk failure. They argued from a very informed vantage point that the probability of disk failures were up to 13 times higher than the disk failures reported by drive and array manufacturers. Independent research confirms that view. The killer problem is not a single failure, but the problem of a double or triple drive failure. These probabilities are up to 150 times higher for double disk failure, or up to 2,000X higher for triple drive failure than those suggested by the manufacturers. The technology trends point to increasing recovery times in the event of a disk failure, and there are a rapidly increasing number of disks in service with ever denser and unreliable packaging. Good intentions abound, but the pressures of meeting vendor shipment targets and staying within IT budgets are significantly greater that the ability to understand and judge the true risks from data loss. And like the bankers, they will be no severe consequences for the individuals if failure occurs.
The fear-factor feeling is back in the pit of my stomach.
Action Item: Wikibon predicts that there will be a $Billion catastrophic business failure caused by data loss in the next five years. CIOs, auditors and storage vendor CEOs must take aggressive actions to ensure that it does not happen to their organization on their watch.
Footnotes: