As offerings from public cloud providers mature; Disaster Recovery (DR) to the cloud is becoming very attractive to many organizations.  By implementing a cloud solution an organization can reduce the amount of infrastructure necessary and the resources required to maintain and manage the environment. 

As the availability of usage based pricing models begin to be more cost effective; real DR starts to become feasible for many companies that could not previously afford to run a duplicate environment at a remote site.   After all, there is no sense in paying for those servers sitting in the cloud waiting to be used.

However, Cloud DR is not quite as feasible as the marketing machines would like you to believe. Yes, it is possible, but a significant amount of planning, testing and operational administration is required to make this work. There are a lot of areas that need to be addressed before a company starts this journey.  Topics that must be considered are:

Security

  • Is the data secure during transmission to the cloud and once on the cloud
  • How is the data protected (username/password, two factor authentication, encryption)
  • How are customers segmented in the cloud
  • Does the cloud provider conduct regular security audits and adhere to regulatory compliance
  • Do I need multiple VLANS and Firewalls/DMZ between applications
  • Will my security certificates be affected

Network Bandwidth/Latency

  • How long will it take to back up my data to the cloud
  • How much bandwidth (B/W) do I need to maintain the data in synch
  • How often can I replicate the data and maintain Recovery point objective (RPO) Service Level Agreements (SLA)
  • In the event of a disaster how long will it take to restore my data, how much B/W will I need
  • Once the environment is up, how much bandwidth will be required to support the user population
  • Will network latency to the cloud affect user performance

IP re-addressing/DNS

  • Do I need to readdress my servers in the cloud
  • Are the VLANs I need going to be available
  • Will routing between VLANs and security policies be available
  • What changes are required to private and public DNS

Operational recovery

  • Are my recovery procedures documented
  • Do I have an escalation process
  • Do I have contact information for my internet provider, DNS, Certificate providers
  • In what order do I recover my servers
  • Do I have a list of the main application stake holders so they can test their applications
  • How do my users gain access
  • How do I communicate recovery status to my users, customers, vendors, employees
  • Are access restrictions required during the outage to support operations under limited constraints (bandwidth, latency, system size)

Reliability of the cloud provider and ability to provide the necessary resources during a DR scenario is critical to the DR plan.  In the case of a major outage such as the East Coast blackout of 2003, 9/11 or Sandy, Many organizations may be placing similar demands on your provider.  Their ability to meet demand during peak loads such as this is paramount to the success of an actual DR scenario.

Documenting critical resources and recovery methods is pivotal during the DR planning process. Focus should be placed only on the most business critical application and the resources needed to support them.

Designing a DR plan

There are two main criteria that need to be specified in any DR plan, the Recovery Point objective (RPO) and the Recovery Time Objective (RTO).  The RPO determines how much data my application can afford to lose in the vent of a disaster.  This will affect how often my data gets backed up or replicated.  Although it might be desirable to specify an RPO of 10 seconds, the laws of physics and finance may over rule that.  The amount of available bandwidth and the network distance between the productions environments and the backup destination will determine the maximum amount of data that can be replicated and how often it can be replicated. With today’s high speed networks, the amount of bandwidth is only limited by the size of one’s wallet but the latency is still a matter of simple physics. No matter how fast networks get the data will never exceed the speed of light.

The Recovery Time Objective is solely dependent on the size of the environment and how well the recovery procedures have been documented. Recovery is critical so it must be handled in a well-orchestrated fashion.  What are my most business critical applications, what systems do they need for support (those will need to come up first); but before they do, network access and security need to be restored.  Do I know who the application owners are? Do I have an escalation procedure? Once the systems are restored, do their IP addressed need to be changed, DNS records, certificates, etc….

Based on the applications, it is possible to have a different set of RPO and RTO for each application.  Protection groups can facilitate organization of systems into different recovery groups.  Scripts may be used to help automate the recovery process within each protection group and to orchestrate the recovery of the entire environment.

Documenting critical resources and recovery methods is pivotal during the DR planning process.   Focus should be placed only on the most business critical application and the resources needed to support them.  All non-essential systems should be left off the plan.  This will reduce cost but most importantly it will reduce the time needed for a full recovery.  Once a DR plan has been implemented; testing and maintenance are the most critical aspects to a successful DR Procedure.  A DR plan is never complete; the best anyone can hope for is that it be up to date.  A DR plan is a work in progress that must be maintained as systems and applications are added and retired.  DR considerations should be a critical component of each and every change control process in the production environment.

For purposes of this article we will be reviewing 4 of the most common alternatives:

  • Managed Production and DR to cloud
  • Host based replication to the cloud
  • Cloud backup and restore
  • Cloud backup/Cloud restore

The DR Hosting Scenario table describes some of the advantages and disadvantages of each of these scenarios and provides some implementation considerations.

Managed Production and DR the cloud

This option involves hosting a production application completely in the cloud.  In this scenario the cloud provider may be a managed colo facility, or a managed services provider.  This is a pure cloud based model for the application in question and pricing varies on the MSP.  For traditional cloud or colo providers they may charge a set fee per managed device and the price would be dependent on the Service Level Agreements (SLA).  The provider is completely responsible for providing DR and access to the application. Other factors that contribute to the price are the amount of data required and the amount of Bandwidth required to access the application remotely. Adherence to the SLA is of utmost importance and during the negotiations you should inquire what methods the provider uses to measure adherence on an ongoing basis and what remunerations will be made if the SLAs are not met.

Due to the current cost of structure of cloud services at this time, this option is most feasible for a small set of mission critical applications and not for an entire IT data center.   A recent analysis we performed for a small SMB organization resulted in a significantly lower TCO to implement a complete Private Cloud solution with DR hosted in a colocation facility.  In this scenario, the cost of the colocation facility, consulting services, software and hardware were amortized in a 3 year lease with a $0 buyout.  In this scenario the monthly lease was slightly less than the monthly cost for the cloud storage.  During a DR scenario the actual savings would accrue even further since there would be no incremental costs with the recovery efforts.

Software as a Service (SAaS) applications such as Email, Customer Relationship Management (CRM) and Business Management are examples of hosted applications that adhere to this model.  These type of services are becoming more readily adopted by many organizations.  Examples include Office 365, Salesforce.com and NetSuite.

Host based replication to the cloud is possibly the costliest alternative but it provides the shortest RPO and RTO.  In this scenario, the production environment is recreated in the cloud. Virtual network switches, routers, firewalls, load balancers and other networking devices would be deployed exactly the same as in production with the same set of security policies. Each host that needs protection needs to have a similarly configured virtual twin in the cloud.  Host based replication is used to replicate the data between the servers and keep it in synch.

In this model the cloud providers usually charge a set fee for resources consumes such as storage, networking bandwidth within the cloud, networking bandwidth in/out of the cloud, RAM and CPU.  The advantage of this model is that recovery is fairly simple since the data is completely in synch and the servers are already running.  In this scenario global load balancers may be used to perform a health check on the production applications and redirect traffic to the DR environment when needed.

Since the entire environment needs to be up and running continuously in the cloud, these charges tend to add up fairly quickly.  For larger environments it may be more cost effective to lease the hard ware required and host it in a colo.

Cloud Backup

Cloud backup is another cloud facet that is gaining in popularity.The idea is that instead of backing up your data to tape and then having to store the tapes offsite, it would be beneficial to synchronize the backups to the cloud. The backups would need to eliminate redundant data (de-dupe) and compress and encrypt the data.  This is referred to as dehydration.  By synchronizing the data to the cloud and storing it on the cloud in a dehydrated fashion, the amount of data being transmitted is greatly reduced thus shortening the backup/recovery process and reducing the cloud storage costs. Encrypting the data protects it while it is being transmitted and at rest on shared cloud storage.

Backing up the data to the cloud is fairly simple and many of the traditional backup vendors have the ability to backup data to a wide variety cloud hosting companies. Once in the cloud, the data must be synchronized and updated on a recurring basis.  The interval between backups constitutes the RPO and is limited by the amount of B/W and latency available between the production site and the Cloud provider. The initial sync can be quite time consuming depending on the amount of data and B/W available but incremental sync can be completed in a much shorter time frame. Some cloud providers have the ability to accept the initial data load from a disk that is created at the production facility thus eliminating the need to perform the initial backup over the network.

Restoring operations in this scenario is much more complicated. The data needs to be restored to alternate equipment preferably at an alternate site. Some traditional DR providers offer programs where they make equipment available on an as needed basis. In this scenario, the network infrastructure would need to be restored first, then the target systems with a base OS and finally the data is restored from the cloud over the network.  This process can take weeks to complete depending on the complexity of the infrastructure and the amount of data involved. This time is then added to the RPO.

Cloud Backup/Cloud Restore

This scenario is similar to the previous one but instead of restoring the data over the network, the data is restored within the cloud to a set of virtual machines.  In this scenario the target hosts need to be pre-provisioned and ready for deployment.  In some cases it may be advantageous to perform a full restore of the servers during the staging process and incremental restores prior to DR tests. At that point the hosts are turned off until needed.  As in the previous model, this model is more attractive than the previous model because it does not require a DR equipment contract and you only pay for the hosts while they are running.

The downside to this model is that the hosts are running in a shared environment so security is a major concern.  Networking needs to be implemented in a purely virtual fashion (virtual switches, routers, firewalls, IDS/IPS, load balancers, etc.).  Since billing is based on a usage model, the virtual network does not need to be turned on until a disaster is declared at which time it is the first thing that is brought back on line.   At his point the stand by hosts are turned on and the backed up data is restored.  This process again is very time consuming and very involved with lots of possible opportunities for human error. To reduce human error, facilitate the DR process and reduce the RTO, automation and orchestration should be implemented as much as possible.

Other Alternatives...

As you can imagine, the landscape for this technology is rapidly evolving.  Major IT vendors are starting to offer their own public cloud offerings and each will have unique features and pricing structures. Stay tuned for updates to this article as the leading virtualization provider starts to enhance their public cloud offering.

Same Old Story

Although some new buzz words and terminology have been thrown into the game, at the end of the day it’s the same old story.  Surviving a disaster is predicated on planning and rehearsal.