Ignore These Two Paths to Cloud Failure at Your Own Risk

Congratulations! Your awesome startup is up and running; you’ve leveraged a ton of open source software and tools, leveraged some great SaaS products, deployed to the cloud and you’re scaling effortlessly with increasing demand. So what could go wrong?

Many things, as it turns out, but two stand out: availability and security. Failure to manage either of these issues can be business busting.

Modern software practices enable tremendous velocity to even the smallest teams through the leverage that open source and outsourcing offer. The downside of this leverage is that it pins the successful operation of your services to your dependencies: when they fail, you fail.

Availability

Let’s look at availability first. Your team has delivered a solution that runs in multiple availability zones in multiple regions, with replicated databases etc etc; there’s just no way you’re going to go down! Right?

I’m sorry to tell you, but this is just not true.

The truth about any cloud provider or data center is that it will fail. Sometimes it’s subtle, but often it can be spectacular as in the September, 2017 AWS S3 failure.

Here are just a few things you might experience during an outage when you use outside vendors to deliver your solution:

You can’t deliver key services
You experience data loss
You experience loss of system visibility
Your Continuous Integration and Delivery (CICD) Pipeline is broken and you can’t deploy

Key Services: Maybe you’re using an outside service to deliver real time events to mobile devices, or you’re outsourcing transactions. If these services are down, it may not really matter that your own service is up.

Data Loss: What if you’re outsourcing your log management? You’ll lose visibility into what is going on with your systems, but more importantly this could lead to spectacular system failures when undelivered logs fill up your server’s disks.

Alternatively, you might be using a third party’s storage for analytics or other functions? Can you cache data until those system become available? What happens if data doesn’t make it there?

Loss of Systems Visibility: What happens when your outsourced monitoring tool suffers an outage? Will you still be able to operate? Will you be flying blind during an outage to your own cloud provider?

CICD Pipeline: It’s not unusual to need to deploy some kind of patch to your production systems during a cloud outage. So, even if you could otherwise weather the storm, not being able to deploy your patch to get your service back online is a real problem.

One word of warning is that many upstream repositories and services lost, or had severely degraded service during the September 2017 AWS S3 outage. Two of these included the important container repositories: Docker and Quay. When these services became seriously affected, it made it difficult for businesses to push up changes, launch new service instances with the latest updates, or even launch new instances at all!

Addressing third party availability:

I hope I’ve convinced you that you really should address this issue. The good news is that you’ve got options. Before you do anything though, take some time and have your team review and catalog your exposure.

As your team builds its report, it should include two items at a minimum: the effect of an outage on the business and the effect of an outage on the service. The effect may be negligible, or it may be profound; try to be specific about the effect. Note that a negligible business effect may also be paired with a serious effect on your systems when software was constructed with the assumption that dependent services are always available.

Armed with your report and the full knowledge of the threats these dependencies pose to your business, you can begin the process of addressing them. Some dependencies may not be worth addressing, or can be kicked quite a ways down the road, but others will clearly be more urgent.

Some items may be addressed through service level agreements (SLA) with your vendors, but avoid simple uptime requirements. If the only outage your vendor has is during 6 hours on Black Friday, they may still satisfy their SLA, but you’d be out a lot of business. Other items may require small changes to your software to handle outages more gracefully, or may inspire you to provide your own in-house solution. Once you understand any actions you need to take, you can prioritize and build a road map that gets you where you need to be.

Now that you’ve got your road map in hand, let’s take a look at security.

Security

The truth about security is that it is not if you will be subject to attack and intrusion, but when, and how big is the blast radius. Addressing security, in general, is well beyond the scope of this article, so I’m just going to focus on what is considered the biggest attack surface for your service: your open source software supply-chain.

The dark side of open source is that it is open; anyone can look at the source code and craft a hacker exploit or sneak some malware into a popular open source library by adding a dependency to some innocuous package that turns out to be malware (just one example).

You’re also subject to the whim of copyright holders who may decide to pull the project from public accessibility which will cause all of your builds and deployments to fail, at best, or work with newly provided namespace replacements from hackers who have seized the moment to introduce their own malicious code into the open source stream.

It’s good news then, that security threats from repositories with well trained maintainers and an active community are usually identified and patched quickly. What makes it hard for you is that you’re likely using dozens, if not hundreds of packages (either directly, or through dependencies). Knowing when, or if you need to upgrade or patch your software is difficult.

Addressing your Supply-chain:

You’ve got a few choices:

Use compensating factors
Insert security scanning into your pipeline
Control your repositories
Write your own software

Use compensating factors: Tried and true solutions, and some newer solutions, can be used to make access to and from your servers difficult. This includes activities like locking down your service ports, configuring firewalls with egress (outbound) rules to prevent access to all but a few endpoints, DMZ networks with strict routing rules and service meshes with zero trust networking.

Insert security scanning into your pipeline: Prioritizing your pipeline work to enable quick turnarounds on patches is essential if you hope to be able to react quickly to threats; especially on servers that are directly connected to the Internet. Use solutions like Snyk, Gemnasium and BlackDuck Software (to name a few) to identify vulnerabilities and how to patch them. These tools will ensure that you aren’t flying blind and can quickly repair your builds and deployments.

One issue worth mentioning with respect to patches is that they have the potential to introduce breaking changes to your software. It’s important to try to remain as up to date as possible with your dependencies to minimize breaking changes so you don’t end up passing on an important patch. Passing on patches has led to some of the web’s most catastrophic hacks (naming no names).

Control your repositories: Make sure you’re in full control of what software makes it into production. You can setup your own mirrors using open source solutions, or tools like JFrog’s artifactory to put you in control of when your open source dependencies get updated and what you will accept. How much of this you do, is up to you. You may focus solely on production, on container images, or you may decide to run your entire pipeline on your own repositories. A side-effect of managing your own repositories is that your pipeline can remain functional during a cloud outage.

Write your own software: If there is a need to lock down your systems, you may consider replacing some of your open source software with software your team has written. Sometimes software suffers from bloat when too many dependencies are added to provide too little value. Go with the less is more approach in those cases and write your own. In other cases, you might consider forking an open source library and harden it to your own needs or just replace the open source library because the function is simply too mission critical.

Next Steps

Your customers expect a lot from you. They expect your service to be cheap, provide excellent value and behave like a utility where the lights are always on. You lose their trust when they can’t get what they need from you, or worse, if their personal information is compromised.

If you are currently deployed to the cloud, or are thinking about it, take some time to think seriously about availability and security. These issues seem extremely boring, but that is the point. Delivering your service should be boring. US-east goes down; no problem. A new zero-day exploit is found; no problem.

Build awareness and urgency around these issues, then build your own plan and roadmap with the full knowledge of what availability and security mean to your business.