Why is Less than 99.95% Desktop Availability Acceptable?

October 11, 2018

In this post, I’ll share some key insights derived from many great conversations with IT and business leaders who are on their journey to the cloud. I’m honored that so many great thinkers felt free to share their concerns and beliefs so we could openly discuss them. What I am finding is that once they understand how we achieve cloud PC performance that is often better than physical PCs, some of the top concerns that arise include reliability, availability and disaster recovery of the desktop infrastructure. Here’s how some of those conversations unfolded and why they’re important for your cloud journey.

Reliability

If I am talking to an IT person, they are likely looking at the problem from a physical PC perspective, because that’s what IT teams do; they think about break/fix SLAs for PCs, workstations and other endpoints such as smartphones and tablets. Typically, they select from terms such as a four-hour or eight-hour 24×7 service or next-day response time, and the level of service they choose dictates the fee they pay. Similarly, if I am talking to an existing VDI customer, they have the same kinds of problems when data center infrastructure fails. Everyone has been doing this for a very long time, and really there hasn’t been a better way – so they might be used to it, but is it optimal?

In the cloud era, why are you waiting around for four hours or eight hours or 24 hours for a response when something breaks? That wait isn’t even for the fix! With cloud PCs, you might have to wait 10 minutes for your IT contact to finish talking to her kids on the phone. Once you have her attention, it can take as little as a minute or two to spin up one, ten or a hundred new cloud PCs!

Here’s more food for thought: As devices modernize and more people want BYOD choices, how much end-user productivity value or delight is there in IT trying to own those physical assets for the masses? It’s a legacy way of thinking. We’ve already improved on reliability, so that’s off the table as an IT concern. So why not allow modern devices to provide first class, user-centric experiences, just like Microsoft recently illustrated with their Project xCloud gaming-from-the-cloud announcement. With this announcement, games can be reliably streamed to end-user devices of their choice under their control, including gaming consoles. Similarly, now you have a modern platform for desktop workloads: Microsoft Azure. If you can improve reliability and delight users, why continue to deal with resource-intensive, IT-owned physical end-point and data center assets that have slow and costly break/fix and managed services SLAs? I can’t think of a reason for this to be the default option, can you?

Availability

Once people start to understand the possibilities for improving the end-user experience, the conversation usually moves to the availability of the cloud service. The first thing I ask is: What is the availability of your existing PC infrastructure or VDI implementation? Nobody to date has been able to give me an accurate answer. However, they do share many war stories of the pain it causes them when things break. I then ask them: How many days of downtime are you willing to tolerate per year and what are you willing to commit to your CIO in regular operations? Most people will say we want no downtime; time is money, and “our business can’t be unavailable because of desktops that run our productivity applications.” Most of the companies I talk with are in the same boat; if desktops go down, business comes to a screeching halt.

Having been in those IT shoes for many years in a previous life, none of that is really surprising. It’s hard to measure the availability of on-prem VDI. However, here’s what is surprising. As companies evaluate moving desktop workloads to the cloud, they are paying little attention to cloud desktop SLAs and they don’t have a good understanding of what it means for their implementation. So, here’s a handy little calculator I like to show. Have a look. Now I have to ask: Why would you pick a lower SLA for your cloud desktops?

The Difference Between 99.95% and 99.5% Uptime

Most tell me that in either case, it’s way better than anything they are doing on-prem. However, they also tell me they are not willing to go to their CIO and ask them to agree in advance to almost two days of unplanned downtime per year!

Workspot is able to deliver an SLA for Desktop Cloud and Workstation Cloud of 99.95% (often greater) – “three and a half nines” – uptime. That’s just 4.38 hours per year of downtime! On the other hand, certain other vendors, who try to do what we do, will ask you to accept 99.5% – “two and a half nines” – so you’re being asked to accept almost 2 entire days of downtime per year. That seems like a lot to me. What’s going on with those cloud solutions? The high-availability of Workspot’s cloud PC service is achievable because Workspot is a modern, born in the cloud, cloud-native architecture vs. a legacy VDI architecture that is merely cloud-enabled. Cloud-enabled simply means that the legacy, on-prem architecture, with its own history of reliability problems, is now being hosted in the cloud. So, when something breaks, a bunch of things breaks, and it takes a while to sort it all out, and all that sorting is at your expense. Cloud-native means the solution is designed from the ground up on a modern stack that meets the needs of the future.

I then encourage them to go to the vendor’s services status site and look for two things. First, note the recent and current up-time metrics and next, examine the service history log for a few years.

System Metrics Should be Transparent & Easy to Understand

These metrics should be transparent and easy to understand as in the Workspot example above. It’s important to see how the service has been trending and not just show how things are right this second to get a complete and accurate understanding. Trawling through the history will give you a good sense of the complexity of a system and the number and types of problems they have had over the years. The length of the history also matters to get a sense of how mature the vendor is in their cloud service operations knowledge.

Disaster Recovery/Business Continuity

As we move past the reliability and availability discussions, disaster recovery is often the next topic of conversation. For PC users this evolves around having spare hardware capacity and moving people to a different site, and it all requires lots of planning. In a business disruption event, the changeover to the DR services is often slow and unreliable, and it requires that people can travel to the alternate location. In a weather event or other natural disaster, that is often not possible. For VDI customers, although better, a DR plan still involves building out additional, expensive data center capacity. Customers considering cloud services are rightly concerned when disaster strikes a cloud region as with the recent outage in Texas.

Workspot has been thinking about this problem for a long time. In response to the legacy DR solutions, and concerns about cloud region outages, we offer our Disaster Recovery Cloud solution to address on-demand PC use cases. Additionally, due to our cloud-native architecture, Workspot is able to offer multi-region cloud redundancy.

Be Educated and Ask the Tough Questions

Whether or not you’re feeling good about the reliability and availability of your IT systems and the DR planning you’ve done, I’m recommending that you take another look. As IT leaders accelerate the move away from on-premises data centers and move workloads to the cloud, one of the most compelling use cases for cloud computing is moving your PCs to the cloud. To start, you can achieve performance, cost and security benefits; but have you thought about the reliability and availability benefits we just discussed? Just as you had SLAs and Recovery Time Objectives (RTOs) for the physical PC world and VDI, you’ll have them for cloud PCs too, but what if the terms were more attractive, the service better and the cost lower? That sounds pretty good, right? I hope this post helps you ask the right questions so your organization can get exactly what it needs to stay protected, grow and prosper.

@harrylabana