I’ve had the pleasure of installing VMware Site Recovery Manager (SRM) all over the USA at enterprise organizations and I’ve seen a consistent pattern emerge on SRM implementation projects.
Technical DR Success (TDRS) is good enough for most organizations. Let me define what I mean by Technical DR Success (TDRS) and SRM.
Putting Together the Technical Pieces
SRM is a great workflow product that has to successfully glue together many technologies such as:
- Two vCenter servers (one at the Primary Site, one at the Recovery Site)
- Two SRM servers (one at each site)
- Two storage arrays (one at each site)
- Four different databases if only using array-based replication, six databases if adding vSphere replication
- If using vSphere replication, add 4 virtual appliances
That is a bunch of different things to work with and once this is all connected and functional, we start testing. For array-based replication, we work with non-production replication volumes that we can test full SRM failover during the project. That means stopping replication between the sites, reversing replication and recovery back to the primary site as well. Once we do failovers and fallbacks multiple times, then we have achieved a functional Technical DR Success (TDRS).
Production Ready
Once we have TDRS, the test non-production replication volumes we were using becomes production DR volumes and the admins then move or storage vMotion production virtual machines to the replicated volume and configure SRM protection.
General Business Continuity Tiers are Good Enough
These replicated volume or volumes are most often simply identified to the application owners as Tier 1 to Tier x application ready with different Service Level Agreements (SLAs) because that seems to be the most granular level of Business Continuity definition many organizations have. This is understandable since even in the best case scenarios, the majority of the companies are coming from DR with physical machines that have to all be restored from tape and the DR plan somewhat puts applications into simple Tiers as generalized application groupings. Often any further level of granularity doesn’t exist since part of the physical DR plan
is to have the application owners involved in the server restoration process and they troubleshoot as they go along.
In fact, I’ve had many conversations with people who rely on physical DR and it is consistently determined by technical teams as either it won’t work at all or would take a huge effort and time to get it to work with even a little bit of success. Therefore, not much effort is expended coming up with granular start up sequences other than general groupings of servers by best guess application types.
Great Features and the Law of Diminishing Returns
One of core features (and selling points) of SRM is it can shutdown and start up the servers in any sequence you want. You can have multi-tiered applications start up in the proper order, run scripts during the shutdown and startup with no user interaction after the failover is initiated. You can have multiple recovery plans and grouped recovery plans. It’s really cool stuff. This obviously takes a fair amount of planning and testing to accomplish this level of failover automation, but it certainly is technically possible with SRM.
But I don’t see this level of granularity happening consistently. I think there are a number of contributing factors that I will discuss below.
Relative Recovery Time Objective (RTO)
The prevailing thought about a disaster with virtualized DR is “There is already a downtime, this is just much shorter.” The fact that the business still has it’s data intact after the disaster trumps everything else.
Most companies are happy with identifying applications into Tiers 1, 2, 3 and stopping there. Start up sequences are great if they know them, but not critical. Just get them protected in their Tier level and move on to the next project.
For those of us that know the technical capabilities and elegant nuances of a product, we sometimes forget that just because a product can do something, it doesn’t mean the business will see enough incremental value to allocate the resource time to automate the failovers. After all, you can always restart a service after all the servers are all recovered.
Considering most came from the physical world with tape restoration as the primary recovery method, restarting a service or rebooting a server to get the application running again is not that big of a big deal.
The Bigger Groups Slow Down Productivity
The typical team size that I have worked with to do the actual technical installation is 1-3 people from the client. Of course, this is normally after I’ve had a series of planning meetings with other people from the organization. Many times it is one primary point person involved in the installation because all of the technical components necessary (servers, storage, networking) are not unusual requests to the multiple departments to acquire what we need. The team is small and efficient and we accomplish a whole lot in just a short engagement.
Consider that each application running on each of the servers is managed by at least one application team, often multiple functional teams. Many applications have tiers and each tier of the application is supported by different groups. It is easy to introduce 3-5 more people to support one application.
I did a quick search and this study details what we’ve all seen to be true. The larger the group, the slower it is to implement a solution.
“This research investigated the impact of small and large work groups on developmental processes and group productivity. There were 329 work groups operating in for-profit and nonprofit organizations across the United States in this study. Groups containing 3 to 8 members were significantly more productive and more developmentally advanced than groups with 9 members or more. Groups containing 3 to 6 members were significantly more productive and more developmentally advanced than groups with 7 to 10 members or 11 members or more. The groups with 7 to 10 members or 11 members were not different from each other. Finally, groups containing 3 to 4 members were significantly more productive and more developmentally advanced on a number of measures than groups with 5 to 6 members. Work-group size is a crucial factor in increasing or decreasing both group development and productivity.”Nobody Owns the Application
This may surprise some people, but I’ve done more than one project where the developer or application owner no longer worked for the company and nobody understood what the application did or if it was important. These get put into Tiers if it is believed to be an important server.
DR is an Insurance Policy
SRM is a great migration tool that is most often used to move data centers for projects, but the primary thing about an SRM or any DR tool is you hope to never have to use it. The client’s deployment team are almost always production operational support or project resources. They are normally very busy people. Grouping applications and streamlining their startup sequences is a project that gets kicked down the road for more important problems that are hurting them today.
A Matter of Perspective
Just like virtualization itself took years to get enough market size to become the standard where we expect higher up times and better management of our servers, I think it will just take a few years before we reach enough organizations with a virtualized DR solution to where they start expecting applications to be available regardless if the primary site is available. Right now, businesses are ecstatic to know that they have the capability to survive a disaster at their primary site.
The technical details of application groupings, start up sequences and automation are of such lesser value when you compare them against the business survival accomplishment of virtualized DR. These things are mostly only appreciated by us virtualized DR aficionados.
I’m okay with that.