TechEarl

Docker Restart Policies and Health Checks

Make containers come back automatically after crashes and reboots, and tell Compose how to wait until a service is actually ready (not just started). Restart policies, HEALTHCHECK, and depends_on: condition: service_healthy.

Ishan Karunaratne⏱️ 7 min readUpdated
Share thisCopied
Restart policies (always, unless-stopped, on-failure), HEALTHCHECK in Dockerfile and Compose, and depends_on: condition: service_healthy to wait until the database is actually ready.

Two related but distinct features: restart policies say what happens when a container exits (crashes, host reboots), and health checks say whether the process inside the container is actually working. Pair them with Compose's depends_on: condition: service_healthy and your stack starts in the right order with the right resilience.

Restart policies

Set with --restart on docker run, or restart: in Compose. Four values:

PolicyBehavior
no (default)Never restart. Container exits, stays exited.
on-failure[:max]Restart only on non-zero exit. Optional max-retries count.
alwaysAlways restart, including after host reboot. Even restarts after you manually stopped it.
unless-stoppedAlways restart, except when you explicitly stopped it. Same as always minus the surprise.
bash
docker run -d --restart unless-stopped --name web nginx:alpine
yaml
services:
  web:
    image: nginx:alpine
    restart: unless-stopped

unless-stopped is the right default for almost every long-running service. It restarts on crashes and host reboots (what you want), but stays stopped when you docker stop it explicitly (also what you want — you stopped it on purpose).

always differs in exactly one place: after a docker stop, the next time the daemon starts (e.g., after a host reboot), always starts the container even though you'd stopped it. Surprising. Use unless-stopped instead.

What about exit codes?

on-failure only restarts on non-zero exit. The conventions:

  • Exit 0 — normal, intentional exit. on-failure doesn't restart.
  • Exit 1-126 — application error. on-failure restarts.
  • Exit 137 — OOMKilled. Container ran out of memory and got killed. on-failure restarts (and the same OOM will probably happen again — fix the memory limit).
  • Exit 143 — SIGTERM. Usually from docker stop. on-failure doesn't restart.

on-failure:5 retries up to 5 times before giving up. Useful when an app crashes on bad input and you want it to try a few times but not loop forever.

For pure services I want to stay running, unless-stopped. For workers that should die hard when something goes wrong, on-failure:N.

Health checks

A health check tells Docker "is this container actually doing its job?" — separate from "is the process alive?" A web server with the process running but returning 500 to every request is alive, but unhealthy.

In a Dockerfile:

dockerfile
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1
  • --interval=30s — check every 30 seconds.
  • --timeout=5s — kill the check if it takes longer than 5 seconds.
  • --start-period=10s — grace period after start during which failures don't count.
  • --retries=3 — flip to "unhealthy" only after this many consecutive failures.
  • CMD ... — the command. Exit 0 = healthy, non-zero = unhealthy.

In Compose:

yaml
services:
  web:
    image: my-app
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/healthz"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Two test forms:

  • ["CMD", ...] — exec form, runs directly without a shell.
  • ["CMD-SHELL", "..."] — shell form, supports pipes and redirects.

docker ps then shows the status next to each container:

code
CONTAINER ID   IMAGE      ...   STATUS                       NAMES
abc123def      my-app     ...   Up 5 minutes (healthy)       web

Healthcheck patterns by service type

Web app:

yaml
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/healthz"]

Assumes the app has a /healthz route that returns 200 when healthy. Standard pattern.

Postgres:

yaml
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]

pg_isready ships with the Postgres image and is purpose-built for this.

MySQL:

yaml
healthcheck:
  test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-p${MYSQL_ROOT_PASSWORD}"]

Redis:

yaml
healthcheck:
  test: ["CMD", "redis-cli", "ping"]

MongoDB:

yaml
healthcheck:
  test: ["CMD", "mongosh", "--quiet", "--eval", "db.runCommand({ ping: 1 })"]

Elasticsearch:

yaml
healthcheck:
  test: ["CMD-SHELL", "curl -fs http://localhost:9200/_cluster/health || exit 1"]

depends_on: condition: service_healthy

This is the payoff. Without it, Compose's depends_on only controls start order:

yaml
services:
  web:
    depends_on:
      - db   # web starts after db... but before db is ready

Postgres takes 5-30 seconds to initialize on first start. web boots in less than that and tries to connect, fails, crashes. With the healthy condition:

yaml
services:
  web:
    depends_on:
      db:
        condition: service_healthy
  db:
    image: postgres:17
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

Now Compose waits until the db healthcheck reports healthy before starting web. The dependency chain works the way the name suggests.

Three conditions:

  • service_started — default, container is created/started.
  • service_healthy — healthcheck reports healthy.
  • service_completed_successfully — service exits with code 0 (for one-shot init containers).

Inspect health-check status

bash
docker inspect --format '{{json .State.Health}}' :container_name | jq

That shows the full health-check history — status, last 5 results, exit codes, output. Useful when a container is unhealthy and you need to see why.

docker events --filter event=health_status streams health transitions in real time.

Common pitfalls

  • restart: always fighting docker stop. Use unless-stopped instead.
  • No healthcheck, depends_on still works in name. Without a healthcheck on the dependency, Compose can't wait for "healthy" — only "started." Add the healthcheck.
  • Healthcheck that uses curl on an image without curl. Alpine images don't ship curl by default. Use wget --spider, or install curl, or write a healthcheck in the app's own runtime (e.g., a Node script that hits its own server).
  • Healthcheck interval too aggressive. Every 1s means the check itself is a load on the container. Default 30s is fine; for slow-starting services 60s or higher.
  • No start_period. A web app that takes 20 seconds to compile a TypeScript bundle on first start will fail every health check for the first 20 seconds. Without start_period, those failures count as retries and the container is marked unhealthy from the start. Set start_period to roughly the slow-start duration.

What to do next

FAQ

Sources

Authoritative references this article was fact-checked against.

TagsDockerRestartHealthcheckComposedepends_onDevOps

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Software Systems Architect · Senior Software Engineer · Engineering Leadership

Software systems architect and senior software engineer with more than two decades designing, building, and running production software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Now a CTO, though what I write here is drawn from the full arc of that work, across architecture, engineering, and operations, not any single job.

Keep reading

Related posts