Kubernetes Liveness and Readiness Probes: How to Avoid Shooting Yourself in the Foot

Nov 4, 2018

I expand on these ideas in my presentation Kubernetes Probes: How to Avoid Shooting Yourself in the Foot.

Kubernetes liveness and readiness probes can be used to make a service more robust and more resilient, by reducing operational issues and improving the quality of service. However, if these probes are not implemented carefully, they can severely degrade the overall operation of a service, to a point where you would be better off without them.

In this article, I will explore how to avoid making service reliability worse when implementing Kubernetes liveness and readiness probes. While the focus of this article is on Kubernetes, the concepts I will highlight are applicable to any application or infrastructural mechanism used for inferring the health of a service and taking automatic, remedial action.

Kubernetes Liveness and Readiness Probes

Kubernetes uses liveness probes to know when to restart a container. If a container is unresponsive—perhaps the application is deadlocked due to a multi-threading defect—restarting the container can make the application more available, despite the defect. It certainly beats paging someone in the middle of the night to restart a container.^[1]

Kubernetes uses readiness probes to decide when the container is available for accepting traffic. The readiness probe is used to control which pods are used as the backends for a service. A pod is considered ready when all of its containers are ready. If a pod is not ready, it is removed from service load balancers. For example, if a container loads a large cache at startup and takes minutes to start, you do not want to send requests to this container until it is ready, or the requests will fail—you want to route requests to other pods, which are capable of servicing requests.

At the time of this writing, Kubernetes supports three mechanisms for implementing liveness and readiness probes: 1) running a command inside a container, 2) making an HTTP request against a container, or 3) opening a TCP socket against a container.

A probe has a number of configuration parameters to control its behaviour, like how often to execute the probe; how long to wait after starting the container to initiate the probe; the number of seconds after which the probe is considered failed; and how many times the probe can fail before giving up. For a liveness probe, giving up means the pod will be restarted. For a readiness probe, giving up means not routing traffic to the pod, but the pod is not restarted. Liveness and readiness probes can be used in conjunction.

Shooting Yourself in the Foot with Readiness Probes

The Kubernetes documentation, as well as many blog posts and examples, somewhat misleadingly emphasizes the use of the readiness probe when starting a container. This is usually the most common consideration—we want to avoid routing requests to the pod until it is ready to accept traffic. However, the readiness probe will continue to be called throughout the lifetime of the container, every periodSeconds, so that the container can make itself temporarily unavailable when one of its dependencies is unavailable, or while running a large batch job, performing maintenance, or something similar.

If you do not realize that the readiness probe will continue to be called after the container is started, you can design readiness probes that can result in serious problems at runtime. Even if you do understand this behaviour, you can still encounter serious problems if the readiness probe does not consider exceptional system dynamics. I will illustrate this through an example.

The following application, implemented in Scala using Akka HTTP, loads a large cache into memory, at startup, before it can handle requests. After the cache is loaded, the atomic variable loaded is set to true. If the cache fails to load, the container will exit and be restarted by Kubernetes, with an exponential-backoff delay.

object CacheServer extends App with CacheServerRoutes with CacheServerProbeRoutes {
  implicit val system = ActorSystem()
  implicit val materializer = ActorMaterializer()
  implicit val executionContext = ExecutionContext.Implicits.global

  val routes: Route = cacheRoutes ~ probeRoutes

  Http().bindAndHandle(routes, "0.0.0.0", 8888)

  val loaded = new AtomicBoolean(false)

  val cache = Cache()
  cache.load().onComplete {
    case Success(_) => loaded.set(true)
    case Failure(ex) =>
      system.terminate().onComplete {
        sys.error(s"Failed to load cache : $ex")
      }
  }
}

The application uses the following /readiness HTTP route for the Kubernetes readiness probe. If the cache is loaded, the /readiness route will always return successfully.

trait CacheServerProbeRoutes {
  def loaded: AtomicBoolean

  val probeRoutes: Route = path("readiness") {
    get {
      if (loaded.get) complete(StatusCodes.OK)
      else complete(StatusCodes.ServiceUnavailable)
    }
  }
}

The HTTP readiness probe is configured as follows:

spec:
  containers:
  - name: cache-server
    image: cache-server/latest
    readinessProbe:
      httpGet:
        path: /readiness
        port: 8888
      initialDelaySeconds: 300
      periodSeconds: 30

This readiness-probe implementation is extremely reliable. Requests are not routed to the application before the cache is loaded. Once the cache is loaded, the /readiness route will perpetually return HTTP 200 and the pod will always be considered ready.

Contrast this implementation with the following application that makes HTTP requests to its dependent services as part of its readiness probe. A readiness probe like this can be useful for catching configuration issues at deployment time—like using the wrong certificate for mutual-TLS, or the wrong credentials for database authentication—ensuring that the service can communicate with all of its dependencies, before becoming ready.

trait ServerWithDependenciesProbeRoutes {
  implicit def ec: ExecutionContext

  def httpClient: HttpRequest => Future[HttpResponse]

  private def httpReadinessRequest(
    uri: Uri,
    f: HttpRequest => Future[HttpResponse] = httpClient): Future[HttpResponse] = {
    f(HttpRequest(method = HttpMethods.HEAD, uri = uri))
  }

  private def checkStatusCode(response: Try[HttpResponse]): Try[Unit] = {
    response match {
      case Success(x) if x.status == StatusCodes.OK => Success(())
      case Success(x) if x.status != StatusCodes.OK => Failure(HttpStatusCodeException(x.status))
      case Failure(ex) => Failure(HttpClientException(ex))
    }
  }

  private def readinessProbe() = {
    val authorizationCheck = httpReadinessRequest("https://authorization.service").transform(checkStatusCode)
    val inventoryCheck = httpReadinessRequest("https://inventory.service").transform(checkStatusCode)
    val telemetryCheck = httpReadinessRequest("https://telemetry.service").transform(checkStatusCode)

    val result = for {
      authorizationResult <- authorizationCheck
      inventoryResult <- inventoryCheck
      telemetryResult <- telemetryCheck
    } yield (authorizationResult, inventoryResult, telemetryResult)

    result
  }

  val probeRoutes: Route = path("readiness") {
    get {
      onComplete(readinessProbe()) {
        case Success(_) => complete(StatusCodes.OK)
        case Failure(_) => complete(StatusCodes.ServiceUnavailable)
      }
    }
  }
}

These concurrent HTTP requests normally return extremely quickly—on the order of milliseconds. The default timeout for the readiness probe is one second. Because these requests succeed the vast majority of the time, it is easy to naively accept the defaults.

But consider what happens if there is a small, temporary increase in latency to one dependent service—maybe due to network congestion, a garbage-collection pause, or a temporary increase in load for the dependent service. If latency to the dependency increases to even slightly above one second, the readiness probe will fail and Kubernetes will no longer route traffic to the pod. Since all of the pods share the same dependency, it is very likely that all pods backing the service will fail the readiness probe at the same time. This will result in all pods being removed from the service routing. With no pods backing the service, Kubernetes will return HTTP 404, the default backend, for all requests to the service. We have created a single point of failure that renders the service completely unavailable, despite our best efforts to improve availability.^[2] In this scenario, we would deliver a much better end-user experience by letting the client requests succeed, albeit with slightly increased latency, rather than making the entire service unavailable for seconds or minutes at a time.

If the readiness probe is verifying a dependency that is exclusive to the container—a private cache or database—then you can be more aggressive in failing the readiness probe, with the assumption that container dependencies are independent. However, if the readiness probe is verifying a shared dependency—like a common service used for authentication, authorization, metrics, logging, or metadata—you should be very conservative in failing the readiness probe.

My recommendations are:

If the container evaluates a shared dependency in the readiness probe, set the readiness-probe timeout longer than the maximum response time for that dependency.
The default failureThreshold count is three—the number of times the readiness probe needs to fail before the pod will no longer be considered ready. Depending on the frequency of the readiness probe—determined by the periodSeconds parameter—you may want to increase the failureThreshold count. The idea is to avoid failing the readiness probe, prematurely, before temporary system-dynamics have elapsed and response latencies have returned to normal.

Shooting Yourself in the Foot with Liveness Probes

Recall that a liveness-probe failure will result in the container being restarted. Unlike a readiness probe, it is not idiomatic to check dependencies in a liveness probe. A liveness probe should be used to check if the container itself has become unresponsive.

One problem with a liveness probe is that the probe may not actually verify the responsiveness of the service. For example, if the service hosts two web servers—one for the service routes and one for status routes, like readiness and liveness probes, or metrics collection—the service can be slow or unresponsive, while the liveness probe route returns just fine. To be effective, the liveness probe must exercise the service in a similar manner to dependent services.

Similar to the readiness probe, it is also important to consider dynamics changing over time. If the liveness-probe timeout is too short, a small increase in response time—perhaps caused by a temporary increase in load—could result in the container being restarted. The restart may result in even more load for other pods backing the service, causing a further cascade of liveness probe failures, making the overall availability of the service even worse. Configuring liveness-probe timeouts on the order of client timeouts, and using a forgiving failureThreshold count, can guard against these cascading failures.

A subtle problem with liveness probes comes from the container startup-latency changing over time. This can be a result of network topology changes, changes in resource allocation, or just increasing load as your service scales. If a container is restarted—due to a Kubernetes-node failure, or a liveness-probe failure—and the initialDelaySeconds parameter is not long enough, you risk never starting the application, with it being killed and restarted, repeatedly, before completely starting. The initialDelaySeconds parameter should be longer than maximum initialization time for the container. To avoid surprises from these dynamics changing over time, it is advantageous to have pods restart on a somewhat regular basis—it should not necessarily be a goal to have individual pods backing a service run for weeks or months at a time. It is important to regularly exercise and evaluate deployments, restarts, and failures as part of running a reliable service.

My recommendations are:

Avoid checking dependencies in liveness probes. Liveness probes should be inexpensive and have response times with minimal variance.
Set liveness-probe timeouts conservatively, so that system dynamics can temporarily or permanently change, without resulting in excessive liveness probe failures. Consider setting liveness-probe timeouts the same magnitude as client timeouts.
Set the initialDelaySeconds parameter conservatively, so that containers can be reliably restarted, even as startup dynamics change over time.
Regularly restart containers to exercise startup dynamics and avoid unexpected behavioural changes during initialization.

Conclusion

Kubernetes liveness and readiness probes can greatly improve the robustness and resilience of your service and provide a superior end-user experience. However, if you do not carefully consider how these probes are used, and especially if you do not consider extraordinary system dynamics, however rare, you risk making the availability of the service worse, rather than better.

You may think that an ounce of prevention is worth a pound of cure. Unfortunately, sometimes the cure can be worse than the disease. Kubernetes liveness and readiness probes are designed to improve reliability, but if they are not implemented considerately, Lorin's Conjecture applies:

Once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or

Unexpected behaviour of a subsystem whose primary purpose was to improve reliability

The advice in this article might help you avoid shooting yourself in the foot, but there are even more ways to shoot yourself in the foot.

It can be important, however, to identify and fix the underlying issue, so monitoring and reporting on the frequency of container restarts becomes an important operational metric. ↩︎
No doubt this will happen in the middle of the night, on a weekend or a holiday, rather than during normal business hours. ↩︎