feat(health): initial stable release v0.9.0

HTTP health check handler with parallel goroutine-per-check execution, 5 s request-derived timeout, and two-level criticality (LevelCritical → 503, LevelDegraded → 200). What's included: - `Checkable` interface (HealthCheck / Name / Priority) and `Level` type with LevelCritical and LevelDegraded constants - `NewHandler(logger, checks...)` returning http.Handler; runs all checks concurrently via buffered channel, returns JSON with per-component status and latency - `ComponentStatus` and `Response` types for the JSON response body Tested-via: todo-api POC integration Reviewed-against: docs/adr/
2026-03-18 14:06:17 -06:00
commit e1b6b7ddd7
14 changed files with 685 additions and 0 deletions
--- a/docs/adr/ADR-001-parallel-checks-with-timeout.md
+++ b/docs/adr/ADR-001-parallel-checks-with-timeout.md
@@ -0,0 +1,26 @@
+# ADR-001: Parallel Checks with 5-Second Timeout
+
+**Status:** Accepted
+**Date:** 2026-03-18
+
+## Context
+
+A health endpoint must interrogate all registered components (database, cache, queue, etc.) and aggregate their results before responding to the caller. The naive approach — running checks sequentially — means the total response time is the sum of all individual check latencies. Under degraded conditions this can be several seconds, making the health endpoint itself a slow, unreliable probe.
+
+Additionally, health checks must be bounded. A component that hangs indefinitely must not cause the health handler to hang indefinitely. There must be a hard wall-clock limit.
+
+## Decision
+
+All registered `Checkable` components are checked concurrently using one goroutine per check. A `context.WithTimeout` of 5 seconds is derived from the incoming request context and passed to every goroutine. Results are collected from a buffered channel sized to the number of checks; the aggregation loop blocks until all goroutines have delivered exactly one result.
+
+The 5-second timeout is applied at the `ServeHTTP` level, not per check, so it is the ceiling for the entire health response including JSON encoding.
+
+The request's own context is used as the parent for the timeout derivation. If the caller cancels its request before 5 seconds (e.g., a probe with a 50 ms deadline), the context cancellation propagates to all running goroutines, and the handler returns before the 5-second ceiling.
+
+## Consequences
+
+- **Positive**: Total response time is bounded by the slowest single check (or 5 s), not the sum of all checks. A test with three 100 ms checks completes in ~100 ms, not ~300 ms.
+- **Positive**: Hanging checks do not cause the handler to hang indefinitely.
+- **Positive**: Caller-side timeouts are respected via context propagation.
+- **Negative**: All checks consume resources simultaneously; there is no back-pressure or concurrency limit. For large numbers of checks this could be a concern, but typical services have a small, bounded number of infrastructure components.
+- **Note**: The buffered channel of size `len(h.checks)` ensures no goroutine leaks even if the aggregation loop returns early due to panic or timeout — goroutines can still write to the channel without blocking.
--- a/docs/adr/ADR-002-critical-vs-warning-levels.md
+++ b/docs/adr/ADR-002-critical-vs-warning-levels.md
@@ -0,0 +1,33 @@
+# ADR-002: Critical vs Warning Levels
+
+**Status:** Accepted
+**Date:** 2026-03-18
+
+## Context
+
+Not all infrastructure components are equally essential. A relational database that stores primary application state is existentially required; if it is down, the service cannot function and callers should stop sending traffic. A read-through cache or a non-essential third-party integration may be important for performance or full feature availability, but the service can still handle requests without them.
+
+A health endpoint that returns 503 whenever any non-critical dependency is unavailable will cause load balancers and orchestrators to pull healthy service instances out of rotation unnecessarily, amplifying an outage.
+
+Conversely, a health endpoint that always returns 200 regardless of component state provides no useful signal to the infrastructure.
+
+## Decision
+
+Two levels are defined as a typed integer `Level`:
+
+- **`LevelCritical` (0)**: The component is essential. If it reports an error, the overall status is `DOWN` and the HTTP response is `503 Service Unavailable`. The name `LevelCritical` is the zero value of the `Level` type, so it is the default when constructing a struct without explicitly setting the field.
+- **`LevelDegraded` (1)**: The component is non-essential. If it reports an error, its per-component status is `DEGRADED` and the overall status is `DEGRADED`, but the HTTP response is `200 OK`.
+
+Aggregation rules:
+1. Start with overall status `UP` and HTTP `200`.
+2. Any `DOWN` component flips overall to `DOWN` and HTTP to `503`. This state cannot be overridden by a `DEGRADED` result.
+3. Any `DEGRADED` component, if the overall is still `UP`, flips it to `DEGRADED` (200 is preserved).
+
+The per-component status strings (`UP`, `DEGRADED`, `DOWN`) are included in the JSON response regardless of level, allowing monitoring dashboards to distinguish between state of individual components.
+
+## Consequences
+
+- **Positive**: Infrastructure (load balancers, Kubernetes readiness probes) gets an honest `503` only when the service is genuinely non-functional.
+- **Positive**: Degraded state is surfaced in the response body for observability without triggering traffic removal.
+- **Positive**: Infra modules (postgres, mysql, etc.) can declare their own priority by implementing `Priority() Level` — typically `LevelCritical`.
+- **Negative**: The binary two-level model does not support finer-grained priorities (e.g., "warn but do not degrade"). Additional levels can be added in future ADRs without breaking existing implementations.
--- a/docs/adr/ADR-003-checkable-interface.md
+++ b/docs/adr/ADR-003-checkable-interface.md
@@ -0,0 +1,39 @@
+# ADR-003: Checkable Interface
+
+**Status:** Accepted
+**Date:** 2026-03-18
+
+## Context
+
+The health handler needs to interrogate arbitrary infrastructure components without knowing their concrete types. The options were:
+
+1. Pass `func(ctx context.Context) error` callbacks directly.
+2. Require a shared `Checkable` interface that infrastructure modules must implement.
+3. Accept an external registry where components register themselves by name.
+
+The health module also needs a way to know what to call a component in the JSON output (`name`) and how to treat its failure (`priority`). Without these pieces of metadata, every caller would have to pass them as separate arguments alongside the check function.
+
+## Decision
+
+A `Checkable` interface is defined in the `health` package with three methods:
+
+```go
+type Checkable interface {
+    HealthCheck(ctx context.Context) error
+    Name() string
+    Priority() Level
+}
+```
+
+Infrastructure modules (`postgres`, `mysql`, etc.) embed `health.Checkable` in their own `Component` interface and implement all three methods. The `health` package does not import any infrastructure module — the dependency flows inward only: infra → health.
+
+`Name()` returns a stable string used as the JSON key in the `components` map. `Priority()` returns the `Level` value that governs the HTTP status code logic (ADR-002). `HealthCheck(ctx)` performs the actual probe (e.g., `pool.Ping(ctx)`).
+
+The handler accepts `...Checkable` as a variadic parameter, so callers can register zero or more components at construction time. No dynamic registration or remove-after-register is supported.
+
+## Consequences
+
+- **Positive**: Infrastructure components carry their own health metadata — no out-of-band registration with name strings and level constants at the call site.
+- **Positive**: Compile-time safety: if a component does not implement all three methods, the assignment `var _ health.Checkable = myComponent{}` fails.
+- **Positive**: The interface is minimal (three methods) and stable; adding a fourth method would be a breaking change and should be versioned.
+- **Negative**: Any new type that wants to participate in health checking must implement three methods, not just a single function. For trivial cases (one-off checks) this is more boilerplate than a bare function callback. However, the named interface is preferred because metadata (`Name`, `Priority`) cannot be forgotten.