Files
health/docs/adr/ADR-001-parallel-checks-with-timeout.md
Rene Nochebuena e1b6b7ddd7 feat(health): initial stable release v0.9.0
HTTP health check handler with parallel goroutine-per-check execution, 5 s request-derived timeout, and two-level criticality (LevelCritical → 503, LevelDegraded → 200).

What's included:
- `Checkable` interface (HealthCheck / Name / Priority) and `Level` type with LevelCritical and LevelDegraded constants
- `NewHandler(logger, checks...)` returning http.Handler; runs all checks concurrently via buffered channel, returns JSON with per-component status and latency
- `ComponentStatus` and `Response` types for the JSON response body

Tested-via: todo-api POC integration
Reviewed-against: docs/adr/
2026-03-18 14:06:17 -06:00

2.3 KiB

ADR-001: Parallel Checks with 5-Second Timeout

Status: Accepted Date: 2026-03-18

Context

A health endpoint must interrogate all registered components (database, cache, queue, etc.) and aggregate their results before responding to the caller. The naive approach — running checks sequentially — means the total response time is the sum of all individual check latencies. Under degraded conditions this can be several seconds, making the health endpoint itself a slow, unreliable probe.

Additionally, health checks must be bounded. A component that hangs indefinitely must not cause the health handler to hang indefinitely. There must be a hard wall-clock limit.

Decision

All registered Checkable components are checked concurrently using one goroutine per check. A context.WithTimeout of 5 seconds is derived from the incoming request context and passed to every goroutine. Results are collected from a buffered channel sized to the number of checks; the aggregation loop blocks until all goroutines have delivered exactly one result.

The 5-second timeout is applied at the ServeHTTP level, not per check, so it is the ceiling for the entire health response including JSON encoding.

The request's own context is used as the parent for the timeout derivation. If the caller cancels its request before 5 seconds (e.g., a probe with a 50 ms deadline), the context cancellation propagates to all running goroutines, and the handler returns before the 5-second ceiling.

Consequences

  • Positive: Total response time is bounded by the slowest single check (or 5 s), not the sum of all checks. A test with three 100 ms checks completes in ~100 ms, not ~300 ms.
  • Positive: Hanging checks do not cause the handler to hang indefinitely.
  • Positive: Caller-side timeouts are respected via context propagation.
  • Negative: All checks consume resources simultaneously; there is no back-pressure or concurrency limit. For large numbers of checks this could be a concern, but typical services have a small, bounded number of infrastructure components.
  • Note: The buffered channel of size len(h.checks) ensures no goroutine leaks even if the aggregation loop returns early due to panic or timeout — goroutines can still write to the channel without blocking.