Building Reliable Systems From Unreliable Parts

• 6 min read #reliability

Service-oriented architecture is a great architectural style, but it comes at a price. The more (micro)services you have, the more remote requests you'll have to send across them to have a working system. The more requests you make, the more failures you get. But don't panic yet! Even though you can't avoid failed requests, you can deal with them.

In this post we'll explore resiliency patterns that can improve Quality of Service and latency, and make customers happy (and managers, of course).

We'll cover some basic strategies such as Timeout, Retry, Fallback, and also more advanced ones such as Hedging, Circuit Breaker, Rate Limiting. We'll discuss how to combine mentioned strategies and will learn on examples why mindfully configuring your resiliency strategies is as important as having them.

Reliability. What's That?

A dictionary would say something like this: The quality of being trustworthy or of performing consistently well.

We're more interested in the second part of this definition: The quality of being trustworthy or of *performing consistently well*.

Our service must work well and do it consistently. In other words, our service must return as little 5xx errors as possible. That's what our customers will consider a failure, and a failure that they can't do anything about. I'll call this "Quality of Service" or simply QoS. Later on in this post I'll use QoS and Reliability interchangeably for the sake of simplicity.

Outage Math

If we had a single service on a single server it'd be easy: 99% QoS means that we only return 5xx error in for a single request out of 100.

But what if, in order to complete a single request, we had to communicate with other services? And each of them can't have QoS of 100% of course. Let's say, every single service has QoS of 99%, or 0.99. How does it add up? It doesn't! Errors are always multiplied! If we have 5 requests to make, we'll see the following: 0.99 * 0.99 * 0.99 * 0.99 * 0.99. How bad is it? It's 0.95 bad. We'll fail 5 out of 100 requests! And that's the best case scenario! There're power outages, floods, network disruptions, you name it! But is it that bad? It's only 5 failures out of 100! - you may ask. Yes, it's that bad. - I'll answer. 0.95 might be good enough for a system with little to no traffic. But the higher the RPS the more significant the impact. Even with 100 RPS you'll have 5 users each second who suffer from a failure. And in a minute you'll have 300 angry customers. Do you want to upset your customers? I don't! From 5 failures per second with 100 RPS it grows to 50,000 at 1 million RPS.

Ways to Improve

michael_scott.gif I bet almost everyone of you shouted at another service's team: WHY CAN'T THEY BUILD SOMETHING THAT WORKS???!!!1 I hope you didn't do it in front of that other team. But the thing is: it's irrelevant if they can't or don't want to, YOU CAN do it. You can improve other service's reliability without touching their code. Today I'll focus on remote calls: gRPC or HTTP, it doesn't matter, the principles are the same for all protocols and for all programming languages.

Now I'll tell you how you can protect your service from other services' failures.

Timeout

That's one of the most basic and simple strategies. I believe it's implemented in every HTTP client. It's as simple as it sounds:

  1. Send request.
  2. Wait for the response.
  3. Don't receive the response in desired time.
  4. Cancel the request.

I cancel the request and how does it improve my service? It doesn't. But it allows you to fail faster and fix faster. You won't make the customer wait forever.

Retry

Also one of the most basic strategies.

  1. Send request.
  2. Receive response.
  3. Not satisfied with it? Try again!

It might be a simple one, but beware of overloading your partners. They might be suffering from something else and the least wanted thing for them now is that everyone would retry all the requests. Make sure you implement randomized and/or exponential backoff intervals between attempts.

Fallback

Basically it's same as Retry but we retry at a different address. Let's say, it's a distributed retry. There might be a networking issue or a data center down, or who knows what.

  1. Send request.
  2. Receive response.
  3. Not satisfied with it? Send request somewhere else!

Circuit Breaker

Circuit Breaker is a more advanced strategy. It works exactly like an electrical circuit breaker - it stops sending the requests to the remote service.

  1. Send request.
  2. Receive response.
  3. Collect statistics.
  4. Not satisfied with the statistics? Stop sending requests! Fail fast!

Like with Retry there's no direct improvements for us, but it improves the overall system's health. If you see that there's a broken circuit for a data center you can start sending requests to another one or page the team.

Rate Limiting

Rate Limiting or Throttling is special as it's mostly used for protection from incoming requests, while all other strategies that I mention protect from outgoing requests. You don't want your service to be overloaded by an accident or a malicious actor. Even if you're not the first-line service you'll need that. Since it usually doesn't serve as a Quota it can be applied separately on the machine level, i.e. without a distributed throttling solution.

  1. Receive request.
  2. You've had enough? Drop it!

Make sure you support Rate Limiting in your clients' code as other teams might implement it and they won't be happy if you bug them with failures that aren't actually failures.

Hedging

What's your first thought when you see the word "hedging"? I imagine a stocks broker or something like this. But here in software engineering it's different. Again, let's start with a dictionary definition "to hedge": protect oneself against loss by making balancing or compensating transactions With a small refactoring we can have the following: protect your client app against failure of a remote service by making a compensating request What we're interested in here is the compensating request. Typically Hedging works like this:

  1. Make request #1.
  2. Wait for a while.
  3. If #1 isn't finished, make request #2 at another address.
  4. Use the result from whichever request comes first.

Repeat steps 1-3 how many times you want - you're limited only by imagination and integer capacity of your language. Hedging is particullary useful in distributed systems where you can create a request to a different data center. Usually the second closest one. To have the best QoS improvement, hedging is done against at least two data centers. But in real world it makes sense to hedge even to a single DC! If we enable primary-secondary hedging strategy, we can see that our service is resilient to partner's service outage in one of the two data centers.

Combine Them All!

So, now with all this you can start combining the strategies to find what works best for your service and your partners' services. Do you want to have timeouts for each retry attempt or only for the whole time of request execution? Why not both! Do you need a retry for every hedging attempt or you'll retry the whole hedged request? It's up to you!

Stay resilient, my friends.