feat(health): initial stable release v0.9.0
HTTP health check handler with parallel goroutine-per-check execution, 5 s request-derived timeout, and two-level criticality (LevelCritical → 503, LevelDegraded → 200). What's included: - `Checkable` interface (HealthCheck / Name / Priority) and `Level` type with LevelCritical and LevelDegraded constants - `NewHandler(logger, checks...)` returning http.Handler; runs all checks concurrently via buffered channel, returns JSON with per-component status and latency - `ComponentStatus` and `Response` types for the JSON response body Tested-via: todo-api POC integration Reviewed-against: docs/adr/
This commit is contained in:
26
docs/adr/ADR-001-parallel-checks-with-timeout.md
Normal file
26
docs/adr/ADR-001-parallel-checks-with-timeout.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# ADR-001: Parallel Checks with 5-Second Timeout
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-03-18
|
||||
|
||||
## Context
|
||||
|
||||
A health endpoint must interrogate all registered components (database, cache, queue, etc.) and aggregate their results before responding to the caller. The naive approach — running checks sequentially — means the total response time is the sum of all individual check latencies. Under degraded conditions this can be several seconds, making the health endpoint itself a slow, unreliable probe.
|
||||
|
||||
Additionally, health checks must be bounded. A component that hangs indefinitely must not cause the health handler to hang indefinitely. There must be a hard wall-clock limit.
|
||||
|
||||
## Decision
|
||||
|
||||
All registered `Checkable` components are checked concurrently using one goroutine per check. A `context.WithTimeout` of 5 seconds is derived from the incoming request context and passed to every goroutine. Results are collected from a buffered channel sized to the number of checks; the aggregation loop blocks until all goroutines have delivered exactly one result.
|
||||
|
||||
The 5-second timeout is applied at the `ServeHTTP` level, not per check, so it is the ceiling for the entire health response including JSON encoding.
|
||||
|
||||
The request's own context is used as the parent for the timeout derivation. If the caller cancels its request before 5 seconds (e.g., a probe with a 50 ms deadline), the context cancellation propagates to all running goroutines, and the handler returns before the 5-second ceiling.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Positive**: Total response time is bounded by the slowest single check (or 5 s), not the sum of all checks. A test with three 100 ms checks completes in ~100 ms, not ~300 ms.
|
||||
- **Positive**: Hanging checks do not cause the handler to hang indefinitely.
|
||||
- **Positive**: Caller-side timeouts are respected via context propagation.
|
||||
- **Negative**: All checks consume resources simultaneously; there is no back-pressure or concurrency limit. For large numbers of checks this could be a concern, but typical services have a small, bounded number of infrastructure components.
|
||||
- **Note**: The buffered channel of size `len(h.checks)` ensures no goroutine leaks even if the aggregation loop returns early due to panic or timeout — goroutines can still write to the channel without blocking.
|
||||
33
docs/adr/ADR-002-critical-vs-warning-levels.md
Normal file
33
docs/adr/ADR-002-critical-vs-warning-levels.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# ADR-002: Critical vs Warning Levels
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-03-18
|
||||
|
||||
## Context
|
||||
|
||||
Not all infrastructure components are equally essential. A relational database that stores primary application state is existentially required; if it is down, the service cannot function and callers should stop sending traffic. A read-through cache or a non-essential third-party integration may be important for performance or full feature availability, but the service can still handle requests without them.
|
||||
|
||||
A health endpoint that returns 503 whenever any non-critical dependency is unavailable will cause load balancers and orchestrators to pull healthy service instances out of rotation unnecessarily, amplifying an outage.
|
||||
|
||||
Conversely, a health endpoint that always returns 200 regardless of component state provides no useful signal to the infrastructure.
|
||||
|
||||
## Decision
|
||||
|
||||
Two levels are defined as a typed integer `Level`:
|
||||
|
||||
- **`LevelCritical` (0)**: The component is essential. If it reports an error, the overall status is `DOWN` and the HTTP response is `503 Service Unavailable`. The name `LevelCritical` is the zero value of the `Level` type, so it is the default when constructing a struct without explicitly setting the field.
|
||||
- **`LevelDegraded` (1)**: The component is non-essential. If it reports an error, its per-component status is `DEGRADED` and the overall status is `DEGRADED`, but the HTTP response is `200 OK`.
|
||||
|
||||
Aggregation rules:
|
||||
1. Start with overall status `UP` and HTTP `200`.
|
||||
2. Any `DOWN` component flips overall to `DOWN` and HTTP to `503`. This state cannot be overridden by a `DEGRADED` result.
|
||||
3. Any `DEGRADED` component, if the overall is still `UP`, flips it to `DEGRADED` (200 is preserved).
|
||||
|
||||
The per-component status strings (`UP`, `DEGRADED`, `DOWN`) are included in the JSON response regardless of level, allowing monitoring dashboards to distinguish between state of individual components.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Positive**: Infrastructure (load balancers, Kubernetes readiness probes) gets an honest `503` only when the service is genuinely non-functional.
|
||||
- **Positive**: Degraded state is surfaced in the response body for observability without triggering traffic removal.
|
||||
- **Positive**: Infra modules (postgres, mysql, etc.) can declare their own priority by implementing `Priority() Level` — typically `LevelCritical`.
|
||||
- **Negative**: The binary two-level model does not support finer-grained priorities (e.g., "warn but do not degrade"). Additional levels can be added in future ADRs without breaking existing implementations.
|
||||
39
docs/adr/ADR-003-checkable-interface.md
Normal file
39
docs/adr/ADR-003-checkable-interface.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# ADR-003: Checkable Interface
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-03-18
|
||||
|
||||
## Context
|
||||
|
||||
The health handler needs to interrogate arbitrary infrastructure components without knowing their concrete types. The options were:
|
||||
|
||||
1. Pass `func(ctx context.Context) error` callbacks directly.
|
||||
2. Require a shared `Checkable` interface that infrastructure modules must implement.
|
||||
3. Accept an external registry where components register themselves by name.
|
||||
|
||||
The health module also needs a way to know what to call a component in the JSON output (`name`) and how to treat its failure (`priority`). Without these pieces of metadata, every caller would have to pass them as separate arguments alongside the check function.
|
||||
|
||||
## Decision
|
||||
|
||||
A `Checkable` interface is defined in the `health` package with three methods:
|
||||
|
||||
```go
|
||||
type Checkable interface {
|
||||
HealthCheck(ctx context.Context) error
|
||||
Name() string
|
||||
Priority() Level
|
||||
}
|
||||
```
|
||||
|
||||
Infrastructure modules (`postgres`, `mysql`, etc.) embed `health.Checkable` in their own `Component` interface and implement all three methods. The `health` package does not import any infrastructure module — the dependency flows inward only: infra → health.
|
||||
|
||||
`Name()` returns a stable string used as the JSON key in the `components` map. `Priority()` returns the `Level` value that governs the HTTP status code logic (ADR-002). `HealthCheck(ctx)` performs the actual probe (e.g., `pool.Ping(ctx)`).
|
||||
|
||||
The handler accepts `...Checkable` as a variadic parameter, so callers can register zero or more components at construction time. No dynamic registration or remove-after-register is supported.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Positive**: Infrastructure components carry their own health metadata — no out-of-band registration with name strings and level constants at the call site.
|
||||
- **Positive**: Compile-time safety: if a component does not implement all three methods, the assignment `var _ health.Checkable = myComponent{}` fails.
|
||||
- **Positive**: The interface is minimal (three methods) and stable; adding a fourth method would be a breaking change and should be versioned.
|
||||
- **Negative**: Any new type that wants to participate in health checking must implement three methods, not just a single function. For trivial cases (one-off checks) this is more boilerplate than a bare function callback. However, the named interface is preferred because metadata (`Name`, `Priority`) cannot be forgotten.
|
||||
Reference in New Issue
Block a user