Some think primary storage is not always the best place for de-dup. The thinking is that de-dup works where there is a lot of…duplication. Primary storage tends to hold more transactional data, while secondary storage has more duplicate data.
While this is true, there is more duplicate data on primary storage than users know. Specifically, there is plenty of inert data sitting on primary storage – data that has not been referenced in more than 6 months. Users are almost always surprised about how much we find – around 40% on average.
The next question is what to do with this data – it needs to be cleaned up or moved in order to return that 40% to free pool capacity.
One clean up step is data de-duplication – and in some instances a significant amount can be de-duplicated. What are duplicates doing on primary storage? A lot of data management practices (or lack thereof) lead to online storage being littered with duplicate or wasteful data sets.
- One example: In many cases application engineers will be testing new applications or updates. They need to run tests on real data – but obviously can’t run them on live, production data. So, they make a snap copy of the production data and run the tests against this data set. If they want to run another test, they’ll make another copy and so on. Do they remember to go back into the system and clean up their copies? Most often the answer is no – and this simple process (which is one of many) robs a primary disk system of its precious capacity.
Data de-duplication can have a significant impact on primary storage in addition to secondary storage. But like any storage technology, the way in which it is implemented is the critical part of the equation.
Action Item: Users should recognize that considerable online storage space is wasted due to bad storage management practices, including leaving multiple redundant copies of data on primary disk storage. Data de-duplication applied to primary storage may offer some hope, however in this economic climate, users should start by assessing their specific installations, developing classification, data retention and migration policies and implementing better storage management practices prior to making large investments.
Footnotes: