feat(health): initial stable release v0.9.0

HTTP health check handler with parallel goroutine-per-check execution, 5 s request-derived timeout, and two-level criticality (LevelCritical → 503, LevelDegraded → 200). What's included: - `Checkable` interface (HealthCheck / Name / Priority) and `Level` type with LevelCritical and LevelDegraded constants - `NewHandler(logger, checks...)` returning http.Handler; runs all checks concurrently via buffered channel, returns JSON with per-component status and latency - `ComponentStatus` and `Response` types for the JSON response body Tested-via: todo-api POC integration Reviewed-against: docs/adr/
2026-03-18 14:06:17 -06:00
commit e1b6b7ddd7
14 changed files with 685 additions and 0 deletions
--- a/docs/adr/ADR-001-parallel-checks-with-timeout.md
+++ b/docs/adr/ADR-001-parallel-checks-with-timeout.md
@@ -0,0 +1,26 @@
+# ADR-001: Parallel Checks with 5-Second Timeout
+
+**Status:** Accepted
+**Date:** 2026-03-18
+
+## Context
+
+A health endpoint must interrogate all registered components (database, cache, queue, etc.) and aggregate their results before responding to the caller. The naive approach — running checks sequentially — means the total response time is the sum of all individual check latencies. Under degraded conditions this can be several seconds, making the health endpoint itself a slow, unreliable probe.
+
+Additionally, health checks must be bounded. A component that hangs indefinitely must not cause the health handler to hang indefinitely. There must be a hard wall-clock limit.
+
+## Decision
+
+All registered `Checkable` components are checked concurrently using one goroutine per check. A `context.WithTimeout` of 5 seconds is derived from the incoming request context and passed to every goroutine. Results are collected from a buffered channel sized to the number of checks; the aggregation loop blocks until all goroutines have delivered exactly one result.
+
+The 5-second timeout is applied at the `ServeHTTP` level, not per check, so it is the ceiling for the entire health response including JSON encoding.
+
+The request's own context is used as the parent for the timeout derivation. If the caller cancels its request before 5 seconds (e.g., a probe with a 50 ms deadline), the context cancellation propagates to all running goroutines, and the handler returns before the 5-second ceiling.
+
+## Consequences
+
+- **Positive**: Total response time is bounded by the slowest single check (or 5 s), not the sum of all checks. A test with three 100 ms checks completes in ~100 ms, not ~300 ms.
+- **Positive**: Hanging checks do not cause the handler to hang indefinitely.
+- **Positive**: Caller-side timeouts are respected via context propagation.
+- **Negative**: All checks consume resources simultaneously; there is no back-pressure or concurrency limit. For large numbers of checks this could be a concern, but typical services have a small, bounded number of infrastructure components.
+- **Note**: The buffered channel of size `len(h.checks)` ensures no goroutine leaks even if the aggregation loop returns early due to panic or timeout — goroutines can still write to the channel without blocking.