Executive Summary

In the wake of Hurricane Sandy, a longtime customer wanted to get serious about disaster recovery – but lingering infrastructural issues were preventing a streamlined solution. VMware Infrastructure Navigator and SRM solved the DR problem, but IIS’ new, app-centric perspective and prioritized dependency plan improved performance and prepared the customer for the future.

Services

  • Disaster Recovery
  • Infrastructure
  • Networking Services

A Newfound Commitment

The true value of any partnership shines through in a worst-case scenario. The client, an international HR consulting group with over 1,000 employees, previously could not justify the expense of a new disaster recovery (DR) solution. But Hurricane Sandy changed everything: Central Manhattan lost power, and having become even more reliant on its environment over the past year, the client was committed to maintaining its operations during and after a DR scenario.

In 2016, unplanned outages cost companies an average of $9,000 per minute.

Too Many Solutions

The problem was not that the client lacked a DR plan altogether; in fact, it had several DR strategies in play over three datacenters nationwide. The result, however, was an immensely complex solution that created as many problems as it solved.

When administrators needed to move a VM to the DR site, they would provision a new VM, install Windows and the necessary applications, restore the data … and simply hope that the application came back online. Plenty of smaller companies do business this way, of course, but the method is prone to human error, and as the number of VMs increases, it does not scale to meet requirements. Having completed over 25 previous projects together, IIS and the customer partnered once again to design and implement a single, cohesive DR solution that protected the virtual machines (VMs).

For this client, the failover process was just the beginning – inter-site failover and replication were not clearly defined, either. Each site did not have all of the data of either of the other sites to support a failover.

Before getting into the infrastructural details, there were the business goals to consider; while IIS was brought in to minimize application outage windows in case of site failure, a true solution also needed to marry the priorities of systems administrators and upper management. IT often takes a very VM-centric view of applications, but the resulting solution had to protect the applications for the business, rather than the individual VMs.

After surveying the datacenters and the loose DR plan, IIS sat down with client leadership to learn more about the existing service level agreements (SLAs) and recovery time objectives (RTOs) in place. The customer had no clearly defined RTOs; essentially, systems administrators were trying to get back online as fast as possible, depending on the backup of each VM within the application. But because each application’s VMs had different recovery point objectives (RPOs), IT could not guarantee consistency.

Keep It Simple

Design

Right away, IIS technicians knew they would have to craft a custom solution to protect all three datacenters. Because the customer was using VMware vSphere and Lefthand iSCSI storage at all three sites, DR automation products like Zerto and VMware Site Recovery Manager (SRM) immediately became viable options for their ability to define and test a plan in isolation. Another option was to replicate the data into a public cloud platform, like vCloud Air, but with three independent sites to accommodate, the cost was not justifiable.

IIS also made some fundamental infrastructure adjustments that helped change the VM-centric view held by the customer’s admins. VMware Infrastructure Navigator (VIM) untangles a customer’s giant infrastructural web into an easily understandable application dependency map – simply put, the customer finally finds out what is really on its servers.

The final application failover design sent production applications in Phoenix and New York to Chicago, and Chicago applications to New York. Because Phoenix did not have the infrastructure to accommodate failover from another site, the customer required a solution that put multisite failover burden on one site, while minimally increasing the physical footprint of any datacenter.

For this customer, VMware SRM was to be that solution. Its N+1 failover capability allowed a single site to be a multisite failover location, and any site could be expanded within a consistent design. Additionally, VMs were protected in either direction within each SRM pair – so Chicago could be protected by New York through added protection groups and recovery plans. But before IIS could install what it had prescribed, the client’s infrastructure required the integration work that would ensure the solution performed to its potential.

Implementation

At IIS, Implementation is as much a consultative process as Discovery or Design, so with this solution came a change in certain procedures and processes. Prior to implementation, the customer had provisioned VMs without regard for storage IO or placement – so IIS recommended purpose-built datastores with individual RPOs and performance characteristics. This simple but holistic change offered much-needed performance increases, and maintained an environmental consistency that the customer had felt was an impossible mountain to climb.

With the stage finally set, it was time to bring in VMware SRM. As in any implementation process, the first priority was not to disrupt of day-to-day business; technicians worked late into the night migrating VMs to the allocated datastores on the replicated Logical Unit Numbers (LUNs). Storage space had become a mild constraint, so IIS designed a migration path to clean up old LUNs as new ones were added, redistributing storage into replicated datastores for each RPO.

An automated DR solution is only as good as its predefined procedures

After installation, IIS conducted a 40-point validation process on each site, using test virtual machines.

  • Product
  • Site Connectivity
  • Storage Replication
  • Failover
  • Failback
  • Test failovers
  • Networking Changes
  • Dependency Mappings
  • Priorities

The implementation was on schedule and within budget – but what good is a DR product unless it is tested? Using SRM, IIS simulated the failover process with an isolated nonproduction network and a storage snapshot; without affecting production application and replication, the simulation validated the boot order, IP address changes, dependency mappings, boot times, and custom prompts defined within the recovery plans.

Many customers stop testing and validation here – but in a real DR scenario, it is critical that all staff be fluent in failover processes and procedures. IIS scheduled several failover and failback tests to be conducted by the client’s staff; following the operational documentation, the staff was able to beat the SLAs on every test.

Beyond DR

What began as a disaster recovery plan became a turning point for the customer’s infrastructure. The client no longer considers its infrastructure in terms of its VMs, and with that has come a behavioral shift in the way staff provisions VMs. The standards implemented by IIS became ingrained into the customer’s deployment and maintenance procedures.

As the relationship has progressed, IIS has helped the customer leverage the integration of HPE 3PAR storage, and deploy monitoring software to better manage the environment while optimizing performance replication traffic. But the biggest evolution has come at the very foundations of the infrastructure. By preparing for the worst, the customer and IIS have set the foundation for the future, no matter what – disaster or otherwise – lies ahead.