Kubernetes Startup Probes: Getting Your Feet Under You
Kubernetes liveness and readiness probes can be used to improve the reliability of a service. If they are not used with care, however, they can do the opposite, and degrade the reliability of a service through subtle, unintended consequences.
I wrote a three-part series on how to avoid "shooting yourself in the foot" with Kubernetes liveness and readiness probes. A number of the issues I detailed were related to startup dynamics. Startup probes, introduced in Kubernetes 1.16, were designed to address many of these issues. If you will allow me to continue my self-indulgent podiatric joke: startup probes allow you to get your feet underneath you—at least long enough to then shoot yourself in the foot with the liveness and readiness probes, of course.
To review, the liveness probe will restart a container when it becomes unresponsive and the readiness probe is used to decide when a container is ready to start or stop accepting traffic. Many people assume the readiness probe is only called at startup, but it continues to be called even after the container is advertised as ready. If a container is temporarily busy, for example, it could become un-ready so that requests are routed to other pods. If a readiness probe evaluates a shared dependency among a set of pods, one risks making the entire service unavailable if it is configured too aggressively. However, there is no way to have an aggressive readiness probe at startup—to make containers available for requests as quickly as possible—with a less aggressive readiness probe during steady-state operation.
Many applications have startup dynamics that differ significantly from steady-state. Dynamics that are unique to applicaiton initialization include: populating a cache; re-materializing derived state from a journal in an event-sourced application; or establishing persistent connections to dependencies, like databases. This makes tuning liveness and readiness probes challenging. For example, if the
initialDelaySeconds for a liveness probe is not conservative enough, a container may be killed before it is started. This is especially challenging if the startup dynamics change over time, perhaps as your system scales, or exhibits seasonabilty in workloads. If a container has not been restarted in a while and the startup time has increased, you risk not being able to restart pods until the configuration is modified to increase
Startup probes were designed to address these issues. The startup probe is only called during startup and is used to determine when the container is ready to accept requests. If a startup probe is configured, the liveness and readiness checks are disabled until the startup probe succeeds. If a startup probe exceeds the configured
failureThreshold without succeeding, the container is killed and restarted, subject to the pod's
restartPolicy, a behaviour analogous to the liveness probe.
All of the caveats I described with container startup apply with the startup probe:
failureThresholdconservatively so that system dynamics can temporarily or permanently change, without resulting in startup probe failures that prevent a container from starting.
- If the route the startup probe calls directly checks dependencies or performs expensive operations, consider setting
timeoutSecondson the same magnitude to avoid accumulating resources or overloading dependencies. Even if the startup probe times out, the service may still be executing the request.
- Regularly restart containers to exercise startup dynamics and avoid unexpected behavioural changes over time. If a pod has run for months or years without restarting, it is something to be concerned about.
Startup probes also have some unique considerations:
- To make a slow-starting container available as soon as possible, use a startup probe with a very short timeout, but also a very long failure threshold, to avoid killing the container before it has started.
- The fact that the readiness and liveness probes are independent from the startup probes allows you to be very conservative with startup probe failures, or perform different checks, perhaps checks that are only relevant at startup, or much too expensive to perform on a regular basis via the readiness or liveness probe.
Kubernetes startup probes are now widely available for the managed Kubernetes offerings from the leading cloud providers. Think of startup probes as a combination of liveness and readiness probes that only run at startup. Use startup probes to decouple liveness and readiness checks from application initialization and ultimately make services more reliable. Just be careful not to shoot yourself in the foot.
I like the strategy described by Rob Witoff of never letting a container run for longer than a month without patching, redeploying, and restarting. ↩︎