Data Migration within Federated Storage

Understand what Non-disruptive Means

Understand what 'Non-disruptive' Means

When vendors talk about non-disruptive migration you need to squint between the lines and ask the right questions to determine if they mean truly non-disruptive or sometimes non-disruptive.

Wikibon recently defined federated storage as a collection of autonomous storage resources governed by a common management system.  The best way of thinking about federated storage is as a collection of storage resource nodes which are loosely connected. The nodes can be storage arrays or appliances controlling multiple arrays. The management system provides rules, in particular about how data is migrated throughout the network.

One of the major business problems that federated storage tackles is the migration of data between the nodes of a storage network. This can be to allow upgrades in storage technology and to realign the allocation of storage resources to application and business needs. This capability requires non-disruptive migration of data between the nodes.

Non-disruptive migration is of growing value and importance. Most disruptions to an application require extensive planning and result in a narrow window during which migration of data can occur. Solutions have been available for mainframe applications, but open systems solutions have been limited to file-based storage. In general it is easier to lock a file and move it dynamically. However for the business-critical large-scale update-intensive applications which use block-based storage the solutions are not as mature.

There are server based techniques to achieve this (VMWare VMotion being a recent addition) which are useful for small systems. However, the elapsed time and resilience of these techniques does not allow for large amounts of data to be transferred. Array-to-array solutions are the best technology foundation for large-scale migration of data.

Non-disruptive migrations within an array are now possible on most tier 1 and tier 1.5 arrays (e.g. 3PAR, XIV, etc.). This can help reduce the cost of storage and remove performance bottlenecks, but does not tackle major realignments or technology upgrades. Products that support externally attached heterogeneous storage such as Hitachi’s USP V and appliances like the IBM SVC allow migration between arrays within a node. But if the node controller needs a technology upgrade, appliances in theory allow this to happen by upgrading one side of an appliance and then the other. But this is not for the faint of heart as there is no fail-back during the process! I know of at least one large university that tried to perform such a tightrope  walk and ended up taking out all their major applications for days during a migration. And don’t forget that the problem of moving data from one node to another still remains.

As far as I can determine, currently (as of 10/09) the only array-based solution for open systems block-based storage that allows rapid, truly non-disruptive migration of data between federated storage nodes is Hitachi’s High Availability Manager (HAM). This function (which really needs a new name) allows two USP V arrays to be dynamically connected, data to be moved non-disruptively between the two arrays, the application to be cut-over to the new array, and all connections to the original array to be severed. The function uses a metadata quorum disk to arbitrate between the two arrays in the case of any failure during the data transfer process. This is unique in the block-based storage industry as my research suggests XIV, SVC and other products fall short in this capability. EMC’s V-Max as well represents another disruptive generational migration for EMC customers.

In my view, this capability is a fundamental building block to facilitate the adoption of  federated storage networks. Without this capability, federated storage is marketing hype. All storage vendors will need to provide this function for their roadmaps. Users should be asking for details of how and when storage vendors will be delivering such capabilities, and asking the right questions to get to the truth, including:

  1. What do you mean by non-disruptive?
  2. Can I migrate data non-disruptively within a storage array?
  3. Can I migrate data non-disruptively within storage nodes?
  4. Can I migrate data non-disruptively across storage nodes?
  5. When I need to do a storage technology refresh, do my applications take any downtime?
  6. During a so-called non-disruptive migration or upgrade, if something goes wrong how do I recovery?

, ,

  • David – thanks for clarifying some of these issues. I've had numerous discussions with vendors that have claimed emphatically that their products do non-disruptive migration only to find out in fact they only do so in some narrow use cases; or the permutations of host-based software, migration tools and pathing software required are overly complex, confusing and have limited installations. As well, this issue of 'perpetual generational migration' – i.e. being able to refresh a technology w/o downtime will imho become increasingly important and demanded by cloud hosting providers and large shops.

  • Name

    Can you define what you mean by “tier 1.5” storage? In my mind it's between a Clariion and DMX or IBM DS5000-series and IBM 8000-series. I don't understand why XIV would be placed in that category.

  • I would disagree that the SVC is disruptive. The SVC is completely non-disruptive from 4.3 onward, via multiple methods, if we look at just the SVC itself. The problem is that IBM marketing is IBM marketing, and they often do a very poor job of explaining the SVC’s capabilities and limitations, forget keeping current.

    Hardware intermix (e.g. 2145-8F2 and 2145-8G4) is permitted, dependent on software support. So long as all your 2145-8F2’s and 2145-8G4’s are running the same 4.3.x, or newly introduced nodes are at a lower version, it’s the same as adding nodes. Taking 2145-4F2’s to 2145-8G4’s is as simple as adding the new hardware to the cluster, then removing the old hardware. This is covered in “Implementing the IBM System Storage SAN Volume Controller V4.3” (SG246423), Appendix D. This actually removes the need to migrate between clusters in many cases, since typically a cluster migration is done for hardware refresh.
    Node migration remains fully non-disruptive; it consists of changing the VDisk’s preferred IO Group. Migration between arrays and spanned arrays also remains fully non-disruptive with fail-back, as VDisk extents are copied during migration instead of being moved. This eliminates the need for any quorum during migration, because it’s a copy operation. Software upgrade is fully non-disruptive with improved fail-back from 4.2 onward – if nothing else, the SVC team learns from their mistakes quickly.

    But let’s say you need to do a cluster migration due to capacity or performance issues. This is what I’d call “90% non-disruptive.” From the SVC side, it is an entirely non-disruptive operation. The disruptive side actually comes from the host. On the SVC, you simply create a Metro mirror of the data to be migrated between clusters. But this presents an exactly identical disk to the host twice, which obviously doesn’t work. To complete the migration, you need to stop IO, switch the mirror direction, and remount from the new cluster. Typically, it translates to 5-15 minutes of downtime at the host. However, from the perspective of the SVC, it’s entirely non-disruptive. But again, if you’re only doing a hardware upgrade, it’s unnecessary.

    Certainly the SVC is not without its limitations and problems at times – like David, I know of an institution that experienced a multi-day outage due to difficulties during an upgrade. However, their outage occurred not due to design flaws, but rather an edge case bug. If we include those situations, there isn’t a vendor out there I haven’t heard of causing major outages or data loss. We like to cover absolutely all our bases – what about hardware failure? What if the power fails? What if, what if, what if? But as the complexity of storage has grown by leaps and bounds, it’s simply impossible to cover absolutely every possible failure scenario.

    The ultimate limitation of the SVC is the storage behind it and the host in front of it, because it’s not storage itself. IBM is notoriously bad at explaining this. If you need to do a firmware upgrade on a DS4k behind an SVC, you must offline the disks at the SVC or shut down the SVC, because the DS4k is disruptive. If you have an AMS2500 behind it, the opposite becomes true – firmware is non-disruptive. A failure in your storage can and will result in problems at the SVC.

    Standard Disclaimer: I’m not an IBM employee, and IBM still isn’t giving me free stuff.

  • Great post Phil. I think you've done a terrific job of underscoring the complexity of this issue. I would agree that SVC is among the best, and constantly improving. Thanks for weighing in.

  • Pingback: Christophe Bertrand » Blog Archive » Migration – what they don’t tell you()

  • sumeetkm

    I think there are two separate points being discussed here:

    1. Non-disruptive migration of data (within an array, across array, within the federation…)
    2. Non-disruptive upgrade of the migration technology itself.

    I believe both these points are important, but just mixing them together under a “data migration” topic may not be the best way to describe the issues.

  • sumeetkm

    I think there are two separate points being discussed here:

    1. Non-disruptive migration of data (within an array, across array, within the federation…)
    2. Non-disruptive upgrade of the migration technology itself.

    I believe both these points are important, but just mixing them together under a “data migration” topic may not be the best way to describe the issues.