I keep getting asked about the differences between Zerto and SRM, so I did a blog post on the Zerto site discussing 5 differences. I think it’s a fair assessment of the two products now that I’ve had some time to work with Zerto in production. http://www.zerto.com/blog/general/5-differences-between-zerto-and-srm
I read an article where Steve Wozniak talked about the concerns he has with cloud computing. He does raise really good points about the fact that you completely trust your data to someone else.
Additionally, it is not uncommon these days for the data itself to be the most valuable asset a company possesses so the idea of not controlling that data behind your own firewall isn’t that attractive to the business ownership or the IT department.
It’s really unclear the percentages of companies that are willing to allow someone else to house their most valuable assets, but I suspect that Wozniak speaks for the majority of business owners and managers right now.
Just Win, Baby
American Football team owner Al Davis was famous for his motto “Just win, baby”.
So what cloud services needs are consistent wins. As a cloud service provider, one of the biggest, most consistent wins can be with Disaster Recovery (DR). Disaster recovery solutions are a perfect use case for cloud computing.
Here are a few thoughts to why I think DR to the Cloud may be the best first solution for companies wishing to leverage cloud computing.
- It addresses the Wozniak concern: Your company maintains control of the data at your privately-owned primary site. You are just replicating a copy to the cloud to use in case of emergency.
- You control your own Service Level Agreements: The regular production uptime at the Primary site is still under your company’s control on your own servers, maintained by your team meeting your Service Level Agreements (SLA).
- Cloud-based DR can save you money: DR to the Cloud is like an insurance policy and hopefully you never have to fail your production over to the disaster recovery site. Since most cloud-based solution models are paid by the usage of the equipment and assets, the relative cost for maintaining a cloud-based disaster recovery site should offer significant savings over privately hosted DR sites.
- No storage arrays needed: A real constraint historically for virtualized cloud-based DR has been a cost prohibitive requirement of array-based replication. Now with small business targeted basic host-based replication in VMware SRM vSphere Replication or the much more capable enterprise grade host-based replication DR from Zerto, array-based replication is no longer required. In fact, shared storage isn’t even required with host-based replication.
- The DR site can be anywhere: With host-based replication and a cloud provider with multiple locations, you can have your DR set up in a one-to-many configuration where you can choose what region to recover the server.
- It allows for the learning curve: Cloud computing has a great deal of new IT logistics and processes that have to be blended into a hybrid solution. If a totally cloud hosted service is ultimately desired by the business, it allows time for the business management to learn how to deal with external vendors hosting their IT assets while not impacting the production environment.
- Testing recovery reliability: Businesses can measure reliability while developing their DR run books by performing failovers and failbacks on test or lower-tier servers.
Apparently, DR to the Cloud service providers see things similarly to the way I do since Zerto recently announced they had over 30 cloud providers sign up to their new host-based replication Zerto DR to the Cloud solution.
I’ve Seen This Movie Before
I’ve seen this happen before with virtualization itself and cloud computing can take a similar adoption path.
For years, virtualization was only performed for the “low hanging fruit” servers. Even if a server failed because of configuration or other issues, they weren’t high profile, mission-critical server failures that was blamed on virtualization.
Just as importantly, it gave time for the physical server administrators to learn and to catch up to virtualization’s capabilities. That created an environment where virtualization produced constant wins for the IT organization.
Even though DR is far from low hanging fruit, the idea is the same. You don’t impact Production services while building out your DR.
DR to the cloud is the perfect way to build confidence in cloud-based solutions for businesses and over time can help companies discover other areas where cloud solutions fit in their company.
Photo credits: http://wikipedia.com
I’ve had the pleasure of installing VMware Site Recovery Manager (SRM) all over the USA at enterprise organizations and I’ve seen a consistent pattern emerge on SRM implementation projects.
Technical DR Success (TDRS) is good enough for most organizations. Let me define what I mean by Technical DR Success (TDRS) and SRM.
Putting Together the Technical Pieces
SRM is a great workflow product that has to successfully glue together many technologies such as:
- Two vCenter servers (one at the Primary Site, one at the Recovery Site)
- Two SRM servers (one at each site)
- Two storage arrays (one at each site)
- Four different databases if only using array-based replication, six databases if adding vSphere replication
- If using vSphere replication, add 4 virtual appliances
That is a bunch of different things to work with and once this is all connected and functional, we start testing. For array-based replication, we work with non-production replication volumes that we can test full SRM failover during the project. That means stopping replication between the sites, reversing replication and recovery back to the primary site as well. Once we do failovers and fallbacks multiple times, then we have achieved a functional Technical DR Success (TDRS).
Once we have TDRS, the test non-production replication volumes we were using becomes production DR volumes and the admins then move or storage vMotion production virtual machines to the replicated volume and configure SRM protection.
General Business Continuity Tiers are Good Enough
These replicated volume or volumes are most often simply identified to the application owners as Tier 1 to Tier x application ready with different Service Level Agreements (SLAs) because that seems to be the most granular level of Business Continuity definition many organizations have. This is understandable since even in the best case scenarios, the majority of the companies are coming from DR with physical machines that have to all be restored from tape and the DR plan somewhat puts applications into simple Tiers as generalized application groupings. Often any further level of granularity doesn’t exist since part of the physical DR plan
is to have the application owners involved in the server restoration process and they troubleshoot as they go along.
In fact, I’ve had many conversations with people who rely on physical DR and it is consistently determined by technical teams as either it won’t work at all or would take a huge effort and time to get it to work with even a little bit of success. Therefore, not much effort is expended coming up with granular start up sequences other than general groupings of servers by best guess application types.
Great Features and the Law of Diminishing Returns
One of core features (and selling points) of SRM is it can shutdown and start up the servers in any sequence you want. You can have multi-tiered applications start up in the proper order, run scripts during the shutdown and startup with no user interaction after the failover is initiated. You can have multiple recovery plans and grouped recovery plans. It’s really cool stuff. This obviously takes a fair amount of planning and testing to accomplish this level of failover automation, but it certainly is technically possible with SRM.
But I don’t see this level of granularity happening consistently. I think there are a number of contributing factors that I will discuss below.
Relative Recovery Time Objective (RTO)
The prevailing thought about a disaster with virtualized DR is “There is already a downtime, this is just much shorter.” The fact that the business still has it’s data intact after the disaster trumps everything else.
Most companies are happy with identifying applications into Tiers 1, 2, 3 and stopping there. Start up sequences are great if they know them, but not critical. Just get them protected in their Tier level and move on to the next project.
For those of us that know the technical capabilities and elegant nuances of a product, we sometimes forget that just because a product can do something, it doesn’t mean the business will see enough incremental value to allocate the resource time to automate the failovers. After all, you can always restart a service after all the servers are all recovered.
Considering most came from the physical world with tape restoration as the primary recovery method, restarting a service or rebooting a server to get the application running again is not that big of a big deal.
The Bigger Groups Slow Down Productivity
The typical team size that I have worked with to do the actual technical installation is 1-3 people from the client. Of course, this is normally after I’ve had a series of planning meetings with other people from the organization. Many times it is one primary point person involved in the installation because all of the technical components necessary (servers, storage, networking) are not unusual requests to the multiple departments to acquire what we need. The team is small and efficient and we accomplish a whole lot in just a short engagement.
Consider that each application running on each of the servers is managed by at least one application team, often multiple functional teams. Many applications have tiers and each tier of the application is supported by different groups. It is easy to introduce 3-5 more people to support one application.
I did a quick search and this study details what we’ve all seen to be true. The larger the group, the slower it is to implement a solution.“This research investigated the impact of small and large work groups on developmental processes and group productivity. There were 329 work groups operating in for-profit and nonprofit organizations across the United States in this study. Groups containing 3 to 8 members were significantly more productive and more developmentally advanced than groups with 9 members or more. Groups containing 3 to 6 members were significantly more productive and more developmentally advanced than groups with 7 to 10 members or 11 members or more. The groups with 7 to 10 members or 11 members were not different from each other. Finally, groups containing 3 to 4 members were significantly more productive and more developmentally advanced on a number of measures than groups with 5 to 6 members. Work-group size is a crucial factor in increasing or decreasing both group development and productivity.”
Nobody Owns the Application
This may surprise some people, but I’ve done more than one project where the developer or application owner no longer worked for the company and nobody understood what the application did or if it was important. These get put into Tiers if it is believed to be an important server.
DR is an Insurance Policy
SRM is a great migration tool that is most often used to move data centers for projects, but the primary thing about an SRM or any DR tool is you hope to never have to use it. The client’s deployment team are almost always production operational support or project resources. They are normally very busy people. Grouping applications and streamlining their startup sequences is a project that gets kicked down the road for more important problems that are hurting them today.
A Matter of Perspective
Just like virtualization itself took years to get enough market size to become the standard where we expect higher up times and better management of our servers, I think it will just take a few years before we reach enough organizations with a virtualized DR solution to where they start expecting applications to be available regardless if the primary site is available. Right now, businesses are ecstatic to know that they have the capability to survive a disaster at their primary site.
The technical details of application groupings, start up sequences and automation are of such lesser value when you compare them against the business survival accomplishment of virtualized DR. These things are mostly only appreciated by us virtualized DR aficionados.
I’m okay with that.
I recently got the privilege to work with Luke Huckaba @thephuck (the first H is silent, btw) building out his company’s SRM 5 deployment.
Their primary vCenter had a certificate that was an actual certificate generated by their security team and installed on the vCenter server. They were doing it the right way instead of the easy (and lazy) way most of us do it by using the server-signed certificate and clicking ignore every time we log in to vCenter until we click the install and ignore checkbox.
What Luke and I learned was there wasn’t a readily-available centralized step-by-step process to install the certificates that would make vCenter and SRM work properly. So we documented the steps and Luke recorded the installation and posted on his blog site here.
I’m going on a tangent here but – VMware please provide an earlier warning during the installation process that if the certificates don’t match on both vCenters and both SRM servers, you cannot pair the sites.
Right now, the only way to learn that you have certificates that won’t play together is to complete the installation on both the primary and recovery site and then try to pair the sites and get an error message that the sites cannot be paired.
Further, if you are installing vSphere Replication and the CA used to sign the .p12 certificate uses MD5, SRM server is ok with it, but the VRMS is not. The VRMS needs SHA1 to work.