Kubernetes Liveness and Readiness Probes Revisited: How to Avoid Shooting Yourself in the Other Foot

Feb 10, 2019

I expand on these ideas in my presentation Kubernetes Probes: How to Avoid Shooting Yourself in the Foot.

Previously, I wrote an essay describing how Kubernetes liveness and readiness probes can unintentionally reduce service availability, or result in prolonged outages. In the conclusion of that essay, I highlighted Lorin's Conjecture:

Once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or

Unexpected behaviour of a subsystem whose primary purpose was to improve reliability

Kubernetes liveness and readiness probes are tools designed to improve service reliability and availability. However, without considering the dynamics of the entire system, especially exceptional dynamics, you risk making the reliability and availability of a service worse, rather than better. Since publishing my original article, I have encountered even more situations in which liveness and readiness probes can inadvertently degrade service availability. I will expand on two of these cases in this article.

A Readiness Probe with No Liveness Probe

I will start by extending an example from my previous article. At startup, the server defined below, in Scala using Akka HTTP, loads a large cache into memory, before it can handle requests. After the cache is loaded, the atomic variable loaded is set to true. Note that unlike my earlier example, I modified the program to log an error and keep running if it fails to load the cache. I will expand on why I did this later in this article.

object CacheServer extends App with CacheServerRoutes with CacheServerProbeRoutes with StrictLogging {
  implicit val system = ActorSystem()
  implicit val materializer = ActorMaterializer()
  implicit val executionContext = ExecutionContext.Implicits.global

  val routes: Route = cacheRoutes ~ probeRoutes

  Http().bindAndHandle(routes, "0.0.0.0", 8888)

  val loaded = new AtomicBoolean(false)

  val cache = Cache()
  cache.load().onComplete {
    case Success(_) => loaded.set(true)
    case Failure(ex) => logger.error(s"Failed to load cache : $ex")
  }
}

Since the cache can take a few minutes to load, the Kubernetes deployment defines a readiness probe, so that requests will not be routed to the pod until it can respond to requests.

spec:  
  containers:
  - name: cache-server
    image: cache-server/latest
    readinessProbe:
      httpGet:
        path: /readiness
        port: 8888
      periodSeconds: 60

The HTTP route that serves the readiness probe is defined as follows.

trait CacheServerProbeRoutes {
  def loaded: AtomicBoolean

  val probeRoutes: Route = path("readiness") {
    get {
      if (loaded.get) complete(StatusCodes.OK)
      else complete(StatusCodes.ServiceUnavailable)
    }
  }
}

The HTTP route to return the value in the cache for a given identifier is defined below. If the cache is not yet loaded, the server will return HTTP 503, Service Unavailable. If the cache is loaded, the server will look up the identifier in the cache and return the value, or HTTP 404, Not Found, if the identifier does not exist.

trait CacheServerRoutes {
  def loaded: AtomicBoolean
  def cache: Cache

  val cacheRoutes: Route = path("cache" / IntNumber) { id =>
    get {
      if (!loaded.get) {
        complete(StatusCodes.ServiceUnavailable)
      }

      cache.get(id) match {
        case Some(body) =>
          complete(HttpEntity(ContentTypes.`application/json`, body))
        case None =>
          complete(StatusCodes.NotFound)
      }
    }
  }
}

Consider what happens if the cache fails to load. The service only attempts to load the cache once, at startup. If it fails to load the cache, it logs an error message, but the service keeps running. The readiness-probe route will not return a successful HTTP status code if the cache is not loaded, therefore, the pod will never become ready. The pods backing the service might look something like the following. All five pods have a status of Running, but only two of the five pods are ready to serve requests, despite running for over 50 minutes.

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
cache-server-674c544685-5x64f   0/1       Running   0          52m
cache-server-674c544685-bk5mk   1/1       Running   0          54m
cache-server-674c544685-ggh4j   0/1       Running   0          53m
cache-server-674c544685-m7pcb   0/1       Running   0          52m
cache-server-674c544685-rtbhw   1/1       Running   0          52m

This presents a latent problem. The service may appear to be operating normally, with service-level health-checks responding successfully, but there are a smaller number of pods than desired available for handling requests. As I mentioned in my previous article, we also need to consider more than just the initial deployment. Pods will be restarted as they are rebalanced in the cluster, or as Kubernetes nodes are restarted. The service could eventually become completely unavailable as pods are restarted, especially if a coincident event, like an object store being temporarily unavailable, prevents the cache from loading on all pods at the same time.

In contrast, consider what would happen if this deployment also had a liveness probe that exercises the cache route.

spec:  
  containers:
  - name: cache-server
    image: cache-server/latest
    readinessProbe:
      httpGet:
        path: /readiness
        port: 8888
      periodSeconds: 60
    livenessProbe:
      httpGet:
        path: /cache/42
        port: 8080
       initialDelaySeconds: 300
       periodSeconds: 60

If the cache failed to load, the liveness probe would eventually fail and the container would restart, giving it another chance to load the cache. Eventually, the cache should load successfully, meaning the service will return to normal operation on its own, without having to alert someone to intervene. The pods may restart a number of times until the transient preventing the cache from loading eventually wanes. The output from kubectl get pods might look like the following, where all the pods are now ready, but some pods were restarted multiple times.

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
cache-server-7597c6d795-g4tzg   1/1       Running   0          10d
cache-server-7597c6d795-jhp4s   1/1       Running   4          32m
cache-server-7597c6d795-k9szq   1/1       Running   3          32m
cache-server-7597c6d795-nd498   1/1       Running   3          32m
cache-server-7597c6d795-q6mbv   1/1       Running   0          10d

Since the cache takes minutes to load, it is important that the liveness probe have an initial delay, using initialDelaySeconds, longer than it takes to load the cache, otherwise there is a risk of never starting the pod, as I detailed in my previous article.

Similar to the example that I just presented, for a server that runs the risk of becoming unavailable due to deadlock, it is equally important to configure a liveness probe, in addition to a readiness probe, or it can encounter the same issue.

Let it Crash!

One problem with the example that I just presented, is that the server attempts to handle errors when it fails to load the cache, rather than just throwing an exception and exiting the process. An emphasis on handling errors comes from programming models where it is important for the program to recover on its own, or handle exceptions in such a way that the work executed on one thread does not impact the work executed on another thread. Think multi-threaded application-servers or device-drivers written in C++.^[1] Handling errors can also be important for cleaning up resources, like memory allocations or file handles, following a failure. This style of programming has continued to have a lot of influence—perhaps somewhat understandably, it just feels wrong to not be a good citizen and cleanup after ourselves.^[2] However, alternative programming models for handling error exist—like monadic error handling in functional-programming languages, or fine-grained supervision strategies in actor models, like Erlang, Akka, and Microsoft Orleans, that are designed for reliable, distributed computing.

Joe Armstrong, the co-creator of the Erlang programming language, in his PhD dissertation entitled Making reliable distributed systems in the presence of software errors, questioned what an error is:

But what is an error? For programming purpose we can say that:

Exceptions occur when the run-time system does not know what to do.

Errors occur when the programmer doesn’t know what to do.

If an exception is generated by the run-time system, but the programmer had foreseen this and knows what to do to correct the condition that caused the exception, then this is not an error. For example, opening a file which does not exist might cause an exception, but the programmer might decide that this is not an error. They therefore write code which traps this exception and takes the necessary corrective action.

Errors occur when the programmer does not know what to do.

This informed Armstrong's view for what the programmer should do in the event of an error:

How does our philosophy of handling errors fit in with coding practices? What kind of code must the programmer write when they find an error? The philosophy is: let some other process fix the error, but what does this mean for their code? The answer is: let it crash. By this I mean that in the event of an error, then the program should just crash.

James Hamilton, in his paper On Designing and Deploying
Internet-Scale Services, describes the importance of services recovering from failure without the need for administrative action:

If a hardware failure requires any immediate administrative action, the service simply won’t scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won’t work when needed.

This is the exactly the point of liveness and readiness probes: to deal with failure without requiring immediate, administrative action.

Armstrong believed that processes should "do what they are supposed to do or fail as soon as possible" and that it was important that "failure, and the reason for failure, can be detected by remote processes". Returning to my example, if the program simply exited after failing to load the cache, by default, Kubernetes would detect that the container had crashed and restart it, with an exponential back-off delay. Eventually, the cache should load successfully, achieving the same result as configuring a liveness probe, like in the previous example.

object CacheServer extends App with CacheServerRoutes with CacheServerProbeRoutes {
  implicit val system = ActorSystem()
  implicit val materializer = ActorMaterializer()
  implicit val executionContext = ExecutionContext.Implicits.global

  val routes: Route = cacheRoutes ~ probeRoutes

  Http().bindAndHandle(routes, "0.0.0.0", 8888)

  val loaded = new AtomicBoolean(false)

  val cache = Cache()
  cache.load().onComplete {
    case Success(_) => loaded.set(true)
    case Failure(ex) =>
      system.terminate().onComplete {
        sys.error(s"Failed to load cache : $ex") // Let it Crash!
      }
  }
}

Restarting a container by exiting it, or by leveraging the liveness probe, may improve service reliability and availability, but it can be important to monitor container restarts, in order to ultimately resolve underlying issues.

As a programmer, when deciding how to handle errors, you need to consider all of the tools that are available to you, including the ones from the run-time environment, not just the ones native to your programming language or framework. Since Kubernetes will restart containers automatically and it will do so with the added benefit of an exponential back-off delay, the most reliable thing to do when you encounter an error may be to just let it crash!

For example, the Google C++ Style Guide discourages the use of exceptions because it is simply impractical to safely incorporate code that throws exceptions into the huge amount of existing code that follows exception-free conventions, like the use of error codes and assertions. ↩︎
Go does not have exceptions, but it does force the programmer to explicitly handle errors, otherwise it will not compile. Of course, the programmer is still free to ignore these errors. ↩︎