Files
health/docs/adr/ADR-002-critical-vs-warning-levels.md
Rene Nochebuena e1b6b7ddd7 feat(health): initial stable release v0.9.0
HTTP health check handler with parallel goroutine-per-check execution, 5 s request-derived timeout, and two-level criticality (LevelCritical → 503, LevelDegraded → 200).

What's included:
- `Checkable` interface (HealthCheck / Name / Priority) and `Level` type with LevelCritical and LevelDegraded constants
- `NewHandler(logger, checks...)` returning http.Handler; runs all checks concurrently via buffered channel, returns JSON with per-component status and latency
- `ComponentStatus` and `Response` types for the JSON response body

Tested-via: todo-api POC integration
Reviewed-against: docs/adr/
2026-03-18 14:06:17 -06:00

2.5 KiB

ADR-002: Critical vs Warning Levels

Status: Accepted Date: 2026-03-18

Context

Not all infrastructure components are equally essential. A relational database that stores primary application state is existentially required; if it is down, the service cannot function and callers should stop sending traffic. A read-through cache or a non-essential third-party integration may be important for performance or full feature availability, but the service can still handle requests without them.

A health endpoint that returns 503 whenever any non-critical dependency is unavailable will cause load balancers and orchestrators to pull healthy service instances out of rotation unnecessarily, amplifying an outage.

Conversely, a health endpoint that always returns 200 regardless of component state provides no useful signal to the infrastructure.

Decision

Two levels are defined as a typed integer Level:

  • LevelCritical (0): The component is essential. If it reports an error, the overall status is DOWN and the HTTP response is 503 Service Unavailable. The name LevelCritical is the zero value of the Level type, so it is the default when constructing a struct without explicitly setting the field.
  • LevelDegraded (1): The component is non-essential. If it reports an error, its per-component status is DEGRADED and the overall status is DEGRADED, but the HTTP response is 200 OK.

Aggregation rules:

  1. Start with overall status UP and HTTP 200.
  2. Any DOWN component flips overall to DOWN and HTTP to 503. This state cannot be overridden by a DEGRADED result.
  3. Any DEGRADED component, if the overall is still UP, flips it to DEGRADED (200 is preserved).

The per-component status strings (UP, DEGRADED, DOWN) are included in the JSON response regardless of level, allowing monitoring dashboards to distinguish between state of individual components.

Consequences

  • Positive: Infrastructure (load balancers, Kubernetes readiness probes) gets an honest 503 only when the service is genuinely non-functional.
  • Positive: Degraded state is surfaced in the response body for observability without triggering traffic removal.
  • Positive: Infra modules (postgres, mysql, etc.) can declare their own priority by implementing Priority() Level — typically LevelCritical.
  • Negative: The binary two-level model does not support finer-grained priorities (e.g., "warn but do not degrade"). Additional levels can be added in future ADRs without breaking existing implementations.