TechEarl

Docker Restart Policies and Health Checks

Make containers come back automatically after crashes and reboots, and tell Compose how to wait until a service is actually ready (not just started). Restart policies, HEALTHCHECK, and depends_on: condition: service_healthy.

Ishan KarunaratneIshan Karunaratne⏱️ 7 min readUpdated
Share thisCopied

Two related but distinct features: restart policies say what happens when a container exits (crashes, host reboots), and health checks say whether the process inside the container is actually working. Pair them with Compose's depends_on: condition: service_healthy and your stack starts in the right order with the right resilience.

Restart policies

Set with --restart on docker run, or restart: in Compose. Four values:

PolicyBehavior
no (default)Never restart. Container exits, stays exited.
on-failure[:max]Restart only on non-zero exit. Optional max-retries count.
alwaysAlways restart, including after host reboot. Even restarts after you manually stopped it.
unless-stoppedAlways restart, except when you explicitly stopped it. Same as always minus the surprise.
bash
docker run -d --restart unless-stopped --name web nginx:alpine
yaml
services:
  web:
    image: nginx:alpine
    restart: unless-stopped

unless-stopped is the right default for almost every long-running service. It restarts on crashes and host reboots (what you want), but stays stopped when you docker stop it explicitly (also what you want — you stopped it on purpose).

always differs in exactly one place: after a docker stop, the next time the daemon starts (e.g., after a host reboot), always starts the container even though you'd stopped it. Surprising. Use unless-stopped instead.

What about exit codes?

on-failure only restarts on non-zero exit. The conventions:

  • Exit 0 — normal, intentional exit. on-failure doesn't restart.
  • Exit 1-126 — application error. on-failure restarts.
  • Exit 137 — OOMKilled. Container ran out of memory and got killed. on-failure restarts (and the same OOM will probably happen again — fix the memory limit).
  • Exit 143 — SIGTERM. Usually from docker stop. on-failure doesn't restart.

on-failure:5 retries up to 5 times before giving up. Useful when an app crashes on bad input and you want it to try a few times but not loop forever.

For pure services I want to stay running, unless-stopped. For workers that should die hard when something goes wrong, on-failure:N.

Health checks

A health check tells Docker "is this container actually doing its job?" — separate from "is the process alive?" A web server with the process running but returning 500 to every request is alive, but unhealthy.

In a Dockerfile:

dockerfile
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1
  • --interval=30s — check every 30 seconds.
  • --timeout=5s — kill the check if it takes longer than 5 seconds.
  • --start-period=10s — grace period after start during which failures don't count.
  • --retries=3 — flip to "unhealthy" only after this many consecutive failures.
  • CMD ... — the command. Exit 0 = healthy, non-zero = unhealthy.

In Compose:

yaml
services:
  web:
    image: my-app
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/healthz"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Two test forms:

  • ["CMD", ...] — exec form, runs directly without a shell.
  • ["CMD-SHELL", "..."] — shell form, supports pipes and redirects.

docker ps then shows the status next to each container:

code
CONTAINER ID   IMAGE      ...   STATUS                       NAMES
abc123def      my-app     ...   Up 5 minutes (healthy)       web

Healthcheck patterns by service type

Web app:

yaml
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/healthz"]

Assumes the app has a /healthz route that returns 200 when healthy. Standard pattern.

Postgres:

yaml
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]

pg_isready ships with the Postgres image and is purpose-built for this.

MySQL:

yaml
healthcheck:
  test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-p${MYSQL_ROOT_PASSWORD}"]

Redis:

yaml
healthcheck:
  test: ["CMD", "redis-cli", "ping"]

MongoDB:

yaml
healthcheck:
  test: ["CMD", "mongosh", "--quiet", "--eval", "db.runCommand({ ping: 1 })"]

Elasticsearch:

yaml
healthcheck:
  test: ["CMD-SHELL", "curl -fs http://localhost:9200/_cluster/health || exit 1"]

depends_on: condition: service_healthy

This is the payoff. Without it, Compose's depends_on only controls start order:

yaml
services:
  web:
    depends_on:
      - db   # web starts after db... but before db is ready

Postgres takes 5-30 seconds to initialize on first start. web boots in less than that and tries to connect, fails, crashes. With the healthy condition:

yaml
services:
  web:
    depends_on:
      db:
        condition: service_healthy
  db:
    image: postgres:17
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

Now Compose waits until the db healthcheck reports healthy before starting web. The dependency chain works the way the name suggests.

Three conditions:

  • service_started — default, container is created/started.
  • service_healthy — healthcheck reports healthy.
  • service_completed_successfully — service exits with code 0 (for one-shot init containers).

Inspect health-check status

bash
docker inspect --format '{{json .State.Health}}' :container_name | jq

That shows the full health-check history — status, last 5 results, exit codes, output. Useful when a container is unhealthy and you need to see why.

docker events --filter event=health_status streams health transitions in real time.

Common pitfalls

  • restart: always fighting docker stop. Use unless-stopped instead.
  • No healthcheck, depends_on still works in name. Without a healthcheck on the dependency, Compose can't wait for "healthy" — only "started." Add the healthcheck.
  • Healthcheck that uses curl on an image without curl. Alpine images don't ship curl by default. Use wget --spider, or install curl, or write a healthcheck in the app's own runtime (e.g., a Node script that hits its own server).
  • Healthcheck interval too aggressive. Every 1s means the check itself is a load on the container. Default 30s is fine; for slow-starting services 60s or higher.
  • No start_period. A web app that takes 20 seconds to compile a TypeScript bundle on first start will fail every health check for the first 20 seconds. Without start_period, those failures count as retries and the container is marked unhealthy from the start. Set start_period to roughly the slow-start duration.

What to do next

FAQ

Sources

Authoritative references this article was fact-checked against.

TagsDockerRestartHealthcheckComposedepends_onDevOps

Found this useful? Pass it on.

Copied
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years building software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Currently Chief Technology Officer at a healthcare tech startup, which is where most of these field notes come from.

Keep reading

Related posts