David Vellante with Tina Rose
In late 2007, HP's Customer Focused Testing (CFT) Group initiated a project to understand the best way to configure an Oracle 11g database on the HP EVA array for replication. HP wanted to share best practices with customers for the major elements of the deployment, namely servers, storage, interconnect, and the database itself. As well, HP wanted to understand how replication methods (e.g. synchronous, asynchronous), bandwidth, latency, and distance affect replication behavior.
The basic premise of this Peer Incite is that by taking advantage of HP's effort and leveraging its recommendations, customers can make better technology choices for their specific environments, optimize performance, speed implementation, cut costs, and reduce implementation risks.
The Project HP replicated data between two HP Enterprise Virtual Array (EVA) 8000s connected by Fibre Channel over Internet Protocol (FCIP). The primary site (Site A) ran an Oracle 11g Real Application Clusters (RAC) database using Automatic Storage Management] (ASM). This was replicated to an Oracle 11g single-instance database using ASM at the backup site (Site B).
Figure 1 depicts the solution implemented by HP.Following Oracle best practices and tuning the database to the most efficient settings, HP was able to improve the base performance of the OLTP workload selected for the project by 16%. Additionally, for recovery purposes, HP configured two EVA and two ASM disk groups with main online files in the first group and backup files in the second group. HP used a two-controller configuration with 12 disk enclosures using 168 146GB drives spinning at 15K rpm and each LUN configured with RAID 1. The backup disk group was comprised of only 32 physical devices because the backup data is accessed far less frequently and could in theory be configured using RAID 5 and lower spin-speed drives.
Best Practice One critical finding of this project was the recommendation to understand your specific environment for Oracle replication. Specifically, customers should evaluate five key attributes to understand recovery goals and business objectives, including:
- Recovery point objective (RPO) – the amount of tolerable data loss;
- Recovery time objective (RTO) – the maximum time to recover from a primary site failure;
- Bandwidth of the intersite link and other traffic contention for the connection;
- Latency – the round trip delay on the replication link;
- Workload – in particular the write intensity of the application and workload and its peaks/valleys.
Understanding these business and technical attributes will lead to the correct choice of replication technology; namely synchronous, asynchronous, or variations of these (e.g. enhanced asynch).
In order to ensure successful recovery, customers are advised to separate database files using two disk groups on the array comprised of two array groups and two ASM groups. Place the online files in the main group and the backup files in the secondary group and consider less expensive disk devices and protection schemes for the backup group if warranted. Note that if the flashback area is configured in Oracle 11g, Oracle will place a mirrored copy of the online redo logs onto your backup disk group, which can be removed to ensure best performance.
The choice of replication technology could have performance impacts that customers should understand. Specifically, the amount of data pushed through the link, the link bandwidth, and associated latencies can dramatically and detrimentally effect performance in a synchronous environment. Asynchronous replication will maintain performance as latencies increase but will have the drawback of creating greater exposure to data loss as write data queues up in the write history log.
For Oracle 10g or 11g replication environments, prior to acquiring technology customers should access HP's and any other vendor test data, to determine the configuration that best meets business requirements (Replication Best Practices for Oracle 11g with ASM & EVA8x00).
Advice for administrators Basic database tuning allowed HP to improve OLTP workload performance by approximately 15%. As well, choosing the appropriate bandwidth for your workload is fundamental. As an example, HP saw a 17% improvement in application performance when upgrading the link from OC6 to OC9. HP's findings suggest synchronous replication should be spec'd for latencies of 20 ms or less (ideally below 10 ms) with sufficient bandwidth so as not to negatively impact application performance. The rule of thumb of 1 ms latency added for every 100 kilometers over a base minimum of say 4 ms is a reasonable starting point, but users should be warned that mileage will vary depending on the number of switches in the network routing, line noise, and a variety of other factors.
In addition, the following specific guidelines for storage administrators warrant consideration:
- Create at least two disk groups with multiple LUNs for the database.
- Data consistency is critical. Create one or two data replication groups (or data consistency groups), depending on the number of applications being replicated.
- Balance multiple data replication groups across controllers.
- Avoid filling the write history log by appropriately sizing the log based on the RPO requirements of the business.
- Ensure the link between sites has adequate bandwidth for the workload being replicated.
Action Item: HP's CFT initiatives and those like it represent some of the best customer freebies in the business. Based on real world, customer-initiated implementations, these best practice guidelines can save substantial time and money and help users avoid critical mistakes. Storage executives managing projects should ask technical staff three questions: 1) Are such best practices available and have you read them, 2) are they being followed, and 3) where and why do you differ?
Footnotes: