feat(health): initial stable release v0.9.0

HTTP health check handler with parallel goroutine-per-check execution, 5 s request-derived timeout, and two-level criticality (LevelCritical → 503, LevelDegraded → 200).

What's included:
- `Checkable` interface (HealthCheck / Name / Priority) and `Level` type with LevelCritical and LevelDegraded constants
- `NewHandler(logger, checks...)` returning http.Handler; runs all checks concurrently via buffered channel, returns JSON with per-component status and latency
- `ComponentStatus` and `Response` types for the JSON response body

Tested-via: todo-api POC integration
Reviewed-against: docs/adr/
This commit is contained in:
2026-03-18 14:06:17 -06:00
commit e1b6b7ddd7
14 changed files with 685 additions and 0 deletions

View File

@@ -0,0 +1,26 @@
# ADR-001: Parallel Checks with 5-Second Timeout
**Status:** Accepted
**Date:** 2026-03-18
## Context
A health endpoint must interrogate all registered components (database, cache, queue, etc.) and aggregate their results before responding to the caller. The naive approach — running checks sequentially — means the total response time is the sum of all individual check latencies. Under degraded conditions this can be several seconds, making the health endpoint itself a slow, unreliable probe.
Additionally, health checks must be bounded. A component that hangs indefinitely must not cause the health handler to hang indefinitely. There must be a hard wall-clock limit.
## Decision
All registered `Checkable` components are checked concurrently using one goroutine per check. A `context.WithTimeout` of 5 seconds is derived from the incoming request context and passed to every goroutine. Results are collected from a buffered channel sized to the number of checks; the aggregation loop blocks until all goroutines have delivered exactly one result.
The 5-second timeout is applied at the `ServeHTTP` level, not per check, so it is the ceiling for the entire health response including JSON encoding.
The request's own context is used as the parent for the timeout derivation. If the caller cancels its request before 5 seconds (e.g., a probe with a 50 ms deadline), the context cancellation propagates to all running goroutines, and the handler returns before the 5-second ceiling.
## Consequences
- **Positive**: Total response time is bounded by the slowest single check (or 5 s), not the sum of all checks. A test with three 100 ms checks completes in ~100 ms, not ~300 ms.
- **Positive**: Hanging checks do not cause the handler to hang indefinitely.
- **Positive**: Caller-side timeouts are respected via context propagation.
- **Negative**: All checks consume resources simultaneously; there is no back-pressure or concurrency limit. For large numbers of checks this could be a concern, but typical services have a small, bounded number of infrastructure components.
- **Note**: The buffered channel of size `len(h.checks)` ensures no goroutine leaks even if the aggregation loop returns early due to panic or timeout — goroutines can still write to the channel without blocking.

View File

@@ -0,0 +1,33 @@
# ADR-002: Critical vs Warning Levels
**Status:** Accepted
**Date:** 2026-03-18
## Context
Not all infrastructure components are equally essential. A relational database that stores primary application state is existentially required; if it is down, the service cannot function and callers should stop sending traffic. A read-through cache or a non-essential third-party integration may be important for performance or full feature availability, but the service can still handle requests without them.
A health endpoint that returns 503 whenever any non-critical dependency is unavailable will cause load balancers and orchestrators to pull healthy service instances out of rotation unnecessarily, amplifying an outage.
Conversely, a health endpoint that always returns 200 regardless of component state provides no useful signal to the infrastructure.
## Decision
Two levels are defined as a typed integer `Level`:
- **`LevelCritical` (0)**: The component is essential. If it reports an error, the overall status is `DOWN` and the HTTP response is `503 Service Unavailable`. The name `LevelCritical` is the zero value of the `Level` type, so it is the default when constructing a struct without explicitly setting the field.
- **`LevelDegraded` (1)**: The component is non-essential. If it reports an error, its per-component status is `DEGRADED` and the overall status is `DEGRADED`, but the HTTP response is `200 OK`.
Aggregation rules:
1. Start with overall status `UP` and HTTP `200`.
2. Any `DOWN` component flips overall to `DOWN` and HTTP to `503`. This state cannot be overridden by a `DEGRADED` result.
3. Any `DEGRADED` component, if the overall is still `UP`, flips it to `DEGRADED` (200 is preserved).
The per-component status strings (`UP`, `DEGRADED`, `DOWN`) are included in the JSON response regardless of level, allowing monitoring dashboards to distinguish between state of individual components.
## Consequences
- **Positive**: Infrastructure (load balancers, Kubernetes readiness probes) gets an honest `503` only when the service is genuinely non-functional.
- **Positive**: Degraded state is surfaced in the response body for observability without triggering traffic removal.
- **Positive**: Infra modules (postgres, mysql, etc.) can declare their own priority by implementing `Priority() Level` — typically `LevelCritical`.
- **Negative**: The binary two-level model does not support finer-grained priorities (e.g., "warn but do not degrade"). Additional levels can be added in future ADRs without breaking existing implementations.

View File

@@ -0,0 +1,39 @@
# ADR-003: Checkable Interface
**Status:** Accepted
**Date:** 2026-03-18
## Context
The health handler needs to interrogate arbitrary infrastructure components without knowing their concrete types. The options were:
1. Pass `func(ctx context.Context) error` callbacks directly.
2. Require a shared `Checkable` interface that infrastructure modules must implement.
3. Accept an external registry where components register themselves by name.
The health module also needs a way to know what to call a component in the JSON output (`name`) and how to treat its failure (`priority`). Without these pieces of metadata, every caller would have to pass them as separate arguments alongside the check function.
## Decision
A `Checkable` interface is defined in the `health` package with three methods:
```go
type Checkable interface {
HealthCheck(ctx context.Context) error
Name() string
Priority() Level
}
```
Infrastructure modules (`postgres`, `mysql`, etc.) embed `health.Checkable` in their own `Component` interface and implement all three methods. The `health` package does not import any infrastructure module — the dependency flows inward only: infra → health.
`Name()` returns a stable string used as the JSON key in the `components` map. `Priority()` returns the `Level` value that governs the HTTP status code logic (ADR-002). `HealthCheck(ctx)` performs the actual probe (e.g., `pool.Ping(ctx)`).
The handler accepts `...Checkable` as a variadic parameter, so callers can register zero or more components at construction time. No dynamic registration or remove-after-register is supported.
## Consequences
- **Positive**: Infrastructure components carry their own health metadata — no out-of-band registration with name strings and level constants at the call site.
- **Positive**: Compile-time safety: if a component does not implement all three methods, the assignment `var _ health.Checkable = myComponent{}` fails.
- **Positive**: The interface is minimal (three methods) and stable; adding a fourth method would be a breaking change and should be versioned.
- **Negative**: Any new type that wants to participate in health checking must implement three methods, not just a single function. For trivial cases (one-off checks) this is more boilerplate than a bare function callback. However, the named interface is preferred because metadata (`Name`, `Priority`) cannot be forgotten.