Contents |
Overview
In a recent Wikibon study of optimization of performance in Oracle Database environments Virtualization of Oracle Evolves to Best Practice for Production Systems, we determined that there is a requirement for storage admins, DBAs, and staff responsible for negotiating with Oracle to coordinate strategy to optimize Oracle licensing costs. Wikibon found that by improving IO response time and variance with improved storage and servers, Oracle users will positively improve application performance and reduce DBA headaches, as well as optimize Oracle license spend.
While cost and performance are key considerations, we also explored best practices for storage management and execution in mission critical database environments and their implications for application service delivery and troubleshooting. Our respondents characterized the basic dialogue between the DBA and Storage Admin sides of the house as going something like the following:
- DBA to Storage Admin: “Look, I have responsibility for delivering application performance of our critical business applications. Just present the raw storage to me and I will manage it."
- Storage Admin to DBA: “Here’s your allocation of storage. Don’t touch or tune anything. I need to understand what’s going on behind the curtain. Call me if you have a problem."
Underneath this talking past one another is the distinction between storage provisioning and allocation vs. monitoring and troubleshooting application performance. Provisioning and troubleshooting are different processes that require different approaches. The art is devising governance and systems that accommodate both processes effectively. Moreover, unified solutions like Exadata can solve some problems but can exacerbate these issues in many situations.
Provisioning
Provisioning for databases and apps at the outset involves:
- Application DBAs who determine the business requirements and SLAs of the application
- Infrastructure DBAs who are responsible for determining the most appropriate database and application configurations, including performance requirements, file layouts, capacity requirements, file structures, etc.
- Storage Admins determine how the storage will meet these requirements - appropriate tiers, back-up policies and methods, redundancy, etc.
One IT professional explained the process as follows, which is a fairly common approach:
- “We’ve got an automated online request system for DBAs for handling storage provisioning for a new database. For production apps, they will have their storage and database ready to go within a week or two of the request. If its for smaller scale test and dev we provide automated provisioning. Then the storage and infrastructure DBA’s work together to make sure that they have enough storage and that they’re setting up servers appropriately. The application DBA handles the request and managing the performance and setting up the tables and all the information as it relates to the operational layer.”
Troubleshooting
But in the event of an application performance failing to meet SLAs, it’s the Application DBAs who are responsible for finding and fixing the problem in a timely manner. It’s at this point, where they'll engage all the relevant groups to explore the root cause – rogue queries, application code, database performance, network, upgrade-related problems, storage, etc. The approach many enterprises follow is reflected in the following observation or one of our respondents:
- “The reality is the DBA has to have visibility to solve the business problem when something bad happens. The application owner has to have all the end-to-end metrics that they need to see IOP. They need to be able to go into storage and into the web services and other domains and look at these pieces that may affect their database performance."
Storage admins sometimes react to the specter of DBA engagement as follows:
- “You really don’t want a DBA who eats, drinks and sleeps databases to be allocating storage off a frame in an emergency. It does not work well because they don’t understand the potential implications on the rest of the frame, workload allocations, etc. For instance, we had a problem with a SAS job that typically took 17 hrs. to run. The DBA reallocated the storage so the job was split between slow and fast disk and SSD in a manner that it took 52 hrs. to run.”
So What to Do?
When there’s a mission critical application problem DBA’s need control or at least the visibility to get on top of the problem and drive the solution. Best practice is to enable the DBA to have visibility when there’s a problem and therefore act to restore application performance in as timely a manner as possible. Different respondents in our study had some different perspectives that bridged the transparency vs. control dimension:
- “So, we tell our storage admins “Look, they’re not going to touch the switches, they’re not going to touch anything on the firmware side, but they need to be able to allocate and move workloads around and storage around and pools of storage as needed by these tools.” It’s been a little challenging, and people feel like they’re stepping on each other’s toes. In the end, it’s not really a technology issue at all – it’s a people issue."
- “Regarding DBA’s having enough control over their storage assets, there is a natural tension between the storage admin and infrastructure DBA. We worked this through by deciding if we’ve got business performance issues you’d better make sure that the people who are responsible for that have full access across the infrastructure layers to do their job. We restrict that to certain individuals who need to be able to view what’s going on. They don’t necessarily have rights to manage their own storage or to do the same thing that the infrastructure DBA’s are doing as far as the initial table spaces and the subs.”
- “DBAs are a lot happier now that we’ve given them access across all the layers. Before they had all this finger pointing going on. Giving one person the capability to span across different layers kind of resolved all that. Once you’ve developed the scripts and are able to pull that across, it’s kind of easy.”
- “I think giving them the tools to do their job is key to making them happy. Once you’ve done that you’ve gone a long ways to making them happy because then it’s not as much of a struggle. You’re able to do your job and accomplish what you need to do, especially when you’re in a crunch situation. Our view is “What are the things we need to do to improve the way you accomplish your job and what tools do we need to give you?”
- “They need to be able to go into storage and into the web services and other domains and look at these pieces that may affect their database performance. From a DBA standpoint they basically just have command-level access where they can’t change anything, but they can at least view what’s going on.”
Considering Skills
Roles and responsibilities are one thing, but in determining how to govern your troubleshooting process, skills are an important consideration when determining where control should lie.
- “The database team is more application and SLA-centric. They are very experienced and very expensive people. Moreover, database licensing is very expensive, so in terms of staff skills, that’s where we spend our money.”
- “The database staff are typically your most seasoned professionals. While we have good tools and capable staff to do the performance capacity speed-and-feed-type storage management, the depth of knowledge of the latter team is not as mature. They often have to rely on the array vendor to find hot disks, etc. which takes time. They need to have (or more effectively use) automated tools"
Action Item: DBAs responsible for mission critical business applications are in a high risk position. They rightfully need control and visibility commensurate with their responsibilities, and will tend to have the levels of experience and skills to effect timely and accurate solutions to application performance problems. Storage and Application teams need to work together to establish access to storage reporting tools and appropriate levels of control (optimally as a component of a single pane of glass management console) to enable application owners to rapidly solve business critical application problems and restore appropriate levels of performance.
Footnotes: