docs/adr/ADR-002-critical-vs-warning-levels.md

# ADR-002: Critical vs Warning Levels

**Status:** Accepted
**Date:** 2026-03-18

## Context

Not all infrastructure components are equally essential. A relational database that stores primary application state is existentially required; if it is down, the service cannot function and callers should stop sending traffic. A read-through cache or a non-essential third-party integration may be important for performance or full feature availability, but the service can still handle requests without them.

A health endpoint that returns 503 whenever any non-critical dependency is unavailable will cause load balancers and orchestrators to pull healthy service instances out of rotation unnecessarily, amplifying an outage.

Conversely, a health endpoint that always returns 200 regardless of component state provides no useful signal to the infrastructure.

## Decision

Two levels are defined as a typed integer `Level`:

- **`LevelCritical` (0)**: The component is essential. If it reports an error, the overall status is `DOWN` and the HTTP response is `503 Service Unavailable`. The name `LevelCritical` is the zero value of the `Level` type, so it is the default when constructing a struct without explicitly setting the field.
- **`LevelDegraded` (1)**: The component is non-essential. If it reports an error, its per-component status is `DEGRADED` and the overall status is `DEGRADED`, but the HTTP response is `200 OK`.

Aggregation rules:
1. Start with overall status `UP` and HTTP `200`.
2. Any `DOWN` component flips overall to `DOWN` and HTTP to `503`. This state cannot be overridden by a `DEGRADED` result.
3. Any `DEGRADED` component, if the overall is still `UP`, flips it to `DEGRADED` (200 is preserved).

The per-component status strings (`UP`, `DEGRADED`, `DOWN`) are included in the JSON response regardless of level, allowing monitoring dashboards to distinguish between state of individual components.

## Consequences

- **Positive**: Infrastructure (load balancers, Kubernetes readiness probes) gets an honest `503` only when the service is genuinely non-functional.
- **Positive**: Degraded state is surfaced in the response body for observability without triggering traffic removal.
- **Positive**: Infra modules (postgres, mysql, etc.) can declare their own priority by implementing `Priority() Level` — typically `LevelCritical`.
- **Negative**: The binary two-level model does not support finer-grained priorities (e.g., "warn but do not degrade"). Additional levels can be added in future ADRs without breaking existing implementations.
feat(health): initial stable release v0.9.0 HTTP health check handler with parallel goroutine-per-check execution, 5 s request-derived timeout, and two-level criticality (LevelCritical → 503, LevelDegraded → 200). What's included: - `Checkable` interface (HealthCheck / Name / Priority) and `Level` type with LevelCritical and LevelDegraded constants - `NewHandler(logger, checks...)` returning http.Handler; runs all checks concurrently via buffered channel, returns JSON with per-component status and latency - `ComponentStatus` and `Response` types for the JSON response body Tested-via: todo-api POC integration Reviewed-against: docs/adr/ 2026-03-18 14:06:17 -06:00			`# ADR-002: Critical vs Warning Levels`

			`Status: Accepted`
			`Date: 2026-03-18`

			`## Context`

			`Not all infrastructure components are equally essential. A relational database that stores primary application state is existentially required; if it is down, the service cannot function and callers should stop sending traffic. A read-through cache or a non-essential third-party integration may be important for performance or full feature availability, but the service can still handle requests without them.`

			`A health endpoint that returns 503 whenever any non-critical dependency is unavailable will cause load balancers and orchestrators to pull healthy service instances out of rotation unnecessarily, amplifying an outage.`

			`Conversely, a health endpoint that always returns 200 regardless of component state provides no useful signal to the infrastructure.`

			`## Decision`

			Two levels are defined as a typed integer `Level`:

			- `LevelCritical` (0): The component is essential. If it reports an error, the overall status is `DOWN` and the HTTP response is `503 Service Unavailable`. The name `LevelCritical` is the zero value of the `Level` type, so it is the default when constructing a struct without explicitly setting the field.
			- `LevelDegraded` (1): The component is non-essential. If it reports an error, its per-component status is `DEGRADED` and the overall status is `DEGRADED`, but the HTTP response is `200 OK`.

			`Aggregation rules:`
			1. Start with overall status `UP` and HTTP `200`.
			2. Any `DOWN` component flips overall to `DOWN` and HTTP to `503`. This state cannot be overridden by a `DEGRADED` result.
			3. Any `DEGRADED` component, if the overall is still `UP`, flips it to `DEGRADED` (200 is preserved).

			The per-component status strings (`UP`, `DEGRADED`, `DOWN`) are included in the JSON response regardless of level, allowing monitoring dashboards to distinguish between state of individual components.

			`## Consequences`

			- Positive: Infrastructure (load balancers, Kubernetes readiness probes) gets an honest `503` only when the service is genuinely non-functional.
			`- Positive: Degraded state is surfaced in the response body for observability without triggering traffic removal.`
			- Positive: Infra modules (postgres, mysql, etc.) can declare their own priority by implementing `Priority() Level` — typically `LevelCritical`.
			`- Negative: The binary two-level model does not support finer-grained priorities (e.g., "warn but do not degrade"). Additional levels can be added in future ADRs without breaking existing implementations.`