feat(health): initial stable release v0.9.0
HTTP health check handler with parallel goroutine-per-check execution, 5 s request-derived timeout, and two-level criticality (LevelCritical → 503, LevelDegraded → 200). What's included: - `Checkable` interface (HealthCheck / Name / Priority) and `Level` type with LevelCritical and LevelDegraded constants - `NewHandler(logger, checks...)` returning http.Handler; runs all checks concurrently via buffered channel, returns JSON with per-component status and latency - `ComponentStatus` and `Response` types for the JSON response body Tested-via: todo-api POC integration Reviewed-against: docs/adr/
This commit is contained in:
26
docs/adr/ADR-001-parallel-checks-with-timeout.md
Normal file
26
docs/adr/ADR-001-parallel-checks-with-timeout.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# ADR-001: Parallel Checks with 5-Second Timeout
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-03-18
|
||||
|
||||
## Context
|
||||
|
||||
A health endpoint must interrogate all registered components (database, cache, queue, etc.) and aggregate their results before responding to the caller. The naive approach — running checks sequentially — means the total response time is the sum of all individual check latencies. Under degraded conditions this can be several seconds, making the health endpoint itself a slow, unreliable probe.
|
||||
|
||||
Additionally, health checks must be bounded. A component that hangs indefinitely must not cause the health handler to hang indefinitely. There must be a hard wall-clock limit.
|
||||
|
||||
## Decision
|
||||
|
||||
All registered `Checkable` components are checked concurrently using one goroutine per check. A `context.WithTimeout` of 5 seconds is derived from the incoming request context and passed to every goroutine. Results are collected from a buffered channel sized to the number of checks; the aggregation loop blocks until all goroutines have delivered exactly one result.
|
||||
|
||||
The 5-second timeout is applied at the `ServeHTTP` level, not per check, so it is the ceiling for the entire health response including JSON encoding.
|
||||
|
||||
The request's own context is used as the parent for the timeout derivation. If the caller cancels its request before 5 seconds (e.g., a probe with a 50 ms deadline), the context cancellation propagates to all running goroutines, and the handler returns before the 5-second ceiling.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Positive**: Total response time is bounded by the slowest single check (or 5 s), not the sum of all checks. A test with three 100 ms checks completes in ~100 ms, not ~300 ms.
|
||||
- **Positive**: Hanging checks do not cause the handler to hang indefinitely.
|
||||
- **Positive**: Caller-side timeouts are respected via context propagation.
|
||||
- **Negative**: All checks consume resources simultaneously; there is no back-pressure or concurrency limit. For large numbers of checks this could be a concern, but typical services have a small, bounded number of infrastructure components.
|
||||
- **Note**: The buffered channel of size `len(h.checks)` ensures no goroutine leaks even if the aggregation loop returns early due to panic or timeout — goroutines can still write to the channel without blocking.
|
||||
Reference in New Issue
Block a user