27 lines
2.3 KiB
Markdown
27 lines
2.3 KiB
Markdown
|
|
# ADR-001: Parallel Checks with 5-Second Timeout
|
||
|
|
|
||
|
|
**Status:** Accepted
|
||
|
|
**Date:** 2026-03-18
|
||
|
|
|
||
|
|
## Context
|
||
|
|
|
||
|
|
A health endpoint must interrogate all registered components (database, cache, queue, etc.) and aggregate their results before responding to the caller. The naive approach — running checks sequentially — means the total response time is the sum of all individual check latencies. Under degraded conditions this can be several seconds, making the health endpoint itself a slow, unreliable probe.
|
||
|
|
|
||
|
|
Additionally, health checks must be bounded. A component that hangs indefinitely must not cause the health handler to hang indefinitely. There must be a hard wall-clock limit.
|
||
|
|
|
||
|
|
## Decision
|
||
|
|
|
||
|
|
All registered `Checkable` components are checked concurrently using one goroutine per check. A `context.WithTimeout` of 5 seconds is derived from the incoming request context and passed to every goroutine. Results are collected from a buffered channel sized to the number of checks; the aggregation loop blocks until all goroutines have delivered exactly one result.
|
||
|
|
|
||
|
|
The 5-second timeout is applied at the `ServeHTTP` level, not per check, so it is the ceiling for the entire health response including JSON encoding.
|
||
|
|
|
||
|
|
The request's own context is used as the parent for the timeout derivation. If the caller cancels its request before 5 seconds (e.g., a probe with a 50 ms deadline), the context cancellation propagates to all running goroutines, and the handler returns before the 5-second ceiling.
|
||
|
|
|
||
|
|
## Consequences
|
||
|
|
|
||
|
|
- **Positive**: Total response time is bounded by the slowest single check (or 5 s), not the sum of all checks. A test with three 100 ms checks completes in ~100 ms, not ~300 ms.
|
||
|
|
- **Positive**: Hanging checks do not cause the handler to hang indefinitely.
|
||
|
|
- **Positive**: Caller-side timeouts are respected via context propagation.
|
||
|
|
- **Negative**: All checks consume resources simultaneously; there is no back-pressure or concurrency limit. For large numbers of checks this could be a concern, but typical services have a small, bounded number of infrastructure components.
|
||
|
|
- **Note**: The buffered channel of size `len(h.checks)` ensures no goroutine leaks even if the aggregation loop returns early due to panic or timeout — goroutines can still write to the channel without blocking.
|