Hi Brendan,
The index rebuild process is a read data, then write index entry for each row element being indexed. It get's more complex when dealing with more complex index structures like composite indexes, foreign key relationships, etc. But the general activity is read a block, then write a block. While there is no room for large block I/O here, there is room for pre-fetch, as the data can and usually is read sequentially. So typically, this activity normally yields high read cache hit rates as the XIV can usually keep in front of the I/O requests via pre-fetch on the back end.
There are some techniques we use to determine where the bottleneck is.
First, look in the statistics on the XIV, looking at only this server activity. Review read cache hit ratio. This is done by clicking on IOPs, then on read then on hit, and then on miss. The ratio is then comparing the two. Adding hit and miss IOPs, then dividing it into hit. Once the ratio is established, there is some simple math used to determine what we would call anticipated response time.
A typical single block read miss usually takes 15ms - 18ms. Sometimes higher depending on how busy the entire XIV frame is. A typical single block read hit usually takes 200-400ns.
So, if you are enjoying 80% read cache hit rate, your anticipated average read response time should be roughly 3.8ms. If you look at your XIV statistics and see significantly higher than 3.8ms this is an indication that the latency is being introduced elsewhere, outside of the XIV frame. It is that simple.
If your average read cache hit rate is 40%, then the anticipated average read response time should be 9-10ms. Still not bad, but could be better. Cache hit characteristics can be affected by using host-based striping (at the logical volume level). Since we are already striping and mirroring all data on XIV anyway, there is no need to add an additional striping layer at the host. It fragments the I/O profile, and ends up looking like random I/O to XIV. XIV has a very sophisticated caching algorithm to try to keep in front of all workloads, but it is possible to defeat it by randomizing the I/O workload.
The next biggest culprit is zoning and HBA's. A single 4Gb HBA port can push no more than 400MB/s. So, if you are still using slower devices this could also be your limiting factor.
The next thing I would look at is the data layout. Typically the best layout for data for an Oracle database, or any database for that matter, is to start with 6-10 large LUNs for data, and separate LUNs for logs and log groups. You need to separate these out because the I/O profile is dramatically different between log I/O and data and index I/O. Typically log entries are written with short-stroke 512 byte blocks. If you are using Oracle ASM, this is not as much an issue. But if you are using file systems, any file systems, the separation can make a dramatic difference.
The next thing to look at is queue depths. There is a great paper written by an IBMer that goes into great detail on queue depths, which can be found here:
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745
This document is specific to AIX, but in general you need to be mindful of queue depths no matter the O/S. The larger queues allow the database to make judicious use of async I/O, a "fire-and-forget" style I/O that can come from a database in large batches of many small I/O's. Each of the device queues (either HBA or LUN based) are used to temporarily store the I/O until sent down to the storage. If the queues are too small, the O/S must tell the database or application to stop sending until the queue is sufficiently drained. This start/stop behavior is extremely costly. You never want a situation where the database has to stop and wait for anything. Ever.
A word about file systems. Database deployment on file systems is simple, and allows complete visibility by system admins and DBAs alike. But, there are physical limitations as to how much throughput is supported through a single mount point. All O/S's are different, but each has a limit. For example, Linux ext3 has a limit around 300MB per second per file system. This is typically one of most common performance issues we run into. That somehow word got out that the best configuration for a single database on XIV is to have a single LUN, and a single file system. We have worked with several customers this year with this problem. By adding more LUNS, and more file systems and spreading the load, we were able to get their average db file sequential read from 15ms down to 2ms, a huge improvement.
Finally. there is something you can do on the XIV side to improve performance. If you have a partially populated XIV, or PPOP as we call it, by adding more modules, you not only add more spindles, but you add more overall cache to the workload, thus improving the cache hit rate of any workload on the system.
One final note. A Gen3 XIV also has the ability to support SSD's as a secondary cache layer. Up to 6TB of additional read cache can be added to the frame to further increase read cache hit rates. For busy databases, we highly suggest this option as a way to insure that random I/O profiles are highly optimized on XIV.
If this doesn’t address your issue, please contact your IBM Software Support representative for additional assistance. Depending on your location, you can find the appropriate contacts by country here http://www.ibm.com/planetwide/region.html
You can also visit the IBM XIV technical support page here: http://www-947.ibm.com/support/entry/portal/Troubleshooting/Hardware/System_Storage/Disk_systems/Enterprise_Storage_Servers/XIV_Storage_System_%282810,_2812%29
We hope that helps. Thank you.
Posted By:AllenM| Fri Nov 09, 2012 09:31
Nice write up on database tuning Allen! I think you meant to say a read hit is 200-300 micro seconds, not nano-seconds ;-).
Brendan,
It's is not clear to me whether this is an XIV Gen 2 or 3. From your post, "We've had an XIV system for a couple of years." so it could be either. Your post also mentioned 86% waiting on sequential read. If your system is, in fact, Gen 2 then best case you might get 2GB/second out of it compared to a Gen 3 which might approach 8-9GB/second (based ob 100% sequential read benchmark data). The gigabit interconnects between the modules on a Gen 2 system are a limiting factor in its bandwidth capabilities. Getting back to your original question, you may be asking a bit much out of your system if it is a Gen 2.
Earnie
Posted By:Earnest Henderson| Fri Nov 09, 2012 10:21