Files
httpclient/docs/adr/ADR-001-circuit-breaker-and-retry.md

60 lines
2.8 KiB
Markdown
Raw Normal View History

# ADR-001: Circuit Breaker and Retry via gobreaker and avast/retry-go
**Status:** Accepted
**Date:** 2026-03-18
## Context
Outbound HTTP calls to external services are subject to transient failures (network blips,
brief service restarts) and sustained failures (outages, overloads). Two complementary
strategies address these cases:
- **Retry** recovers from transient failures by re-attempting the request a limited number
of times before giving up.
- **Circuit breaking** detects sustained failure patterns and stops sending requests to a
failing service, giving it time to recover and preventing the caller from accumulating
blocked goroutines.
Implementing both from scratch introduces risk of subtle bugs (backoff arithmetic, state
machine transitions). Well-tested, widely adopted libraries are preferable.
## Decision
Two external libraries are composed:
**Retry: `github.com/avast/retry-go/v4`**
- Configured via `Config.MaxRetries` and `Config.RetryDelay`.
- Uses `retry.BackOffDelay` (exponential backoff) to avoid hammering a failing service.
- `retry.LastErrorOnly(true)` ensures only the final error from the retry loop is reported.
- Only HTTP 5xx responses trigger a retry. 4xx responses are not retried (they represent
caller errors, not server instability).
**Circuit breaker: `github.com/sony/gobreaker`**
- Configured via `Config.CBThreshold` (consecutive failures to trip) and `Config.CBTimeout`
(time in open state before transitioning to half-open).
- The retry loop runs inside the circuit breaker's `Execute` call. A full retry sequence
counts as one attempt from the circuit breaker's perspective only if all retries fail.
- When the circuit opens, `Do` returns `xerrors.ErrUnavailable` immediately, without
attempting the network call.
- State changes are logged via the duck-typed `Logger` interface.
The nesting order (circuit breaker wraps retry) is intentional: the circuit breaker
accumulates failures at the level of "did the request ultimately succeed after retries",
not at the level of individual attempts.
## Consequences
**Positive:**
- Transient failures are handled transparently by the caller.
- Sustained outages are detected quickly and the circuit opens, returning fast errors.
- Configuration is explicit and environment-variable driven.
- Circuit state changes are observable via logs.
**Negative:**
- Retry with backoff increases total latency for failing requests up to
`MaxRetries * RetryDelay * (2^MaxRetries - 1)` in the worst case.
- The circuit breaker counts only consecutive failures (`ConsecutiveFailures >= CBThreshold`),
not a rolling failure rate. Interleaved successes reset the counter.
- `gobreaker.ErrOpenState` is wrapped in `xerrors.ErrUnavailable`, so callers must check for
this specific code to distinguish circuit-open from normal 503 responses.