60 lines
2.8 KiB
Markdown
60 lines
2.8 KiB
Markdown
|
|
# ADR-001: Circuit Breaker and Retry via gobreaker and avast/retry-go
|
||
|
|
|
||
|
|
**Status:** Accepted
|
||
|
|
**Date:** 2026-03-18
|
||
|
|
|
||
|
|
## Context
|
||
|
|
|
||
|
|
Outbound HTTP calls to external services are subject to transient failures (network blips,
|
||
|
|
brief service restarts) and sustained failures (outages, overloads). Two complementary
|
||
|
|
strategies address these cases:
|
||
|
|
|
||
|
|
- **Retry** recovers from transient failures by re-attempting the request a limited number
|
||
|
|
of times before giving up.
|
||
|
|
- **Circuit breaking** detects sustained failure patterns and stops sending requests to a
|
||
|
|
failing service, giving it time to recover and preventing the caller from accumulating
|
||
|
|
blocked goroutines.
|
||
|
|
|
||
|
|
Implementing both from scratch introduces risk of subtle bugs (backoff arithmetic, state
|
||
|
|
machine transitions). Well-tested, widely adopted libraries are preferable.
|
||
|
|
|
||
|
|
## Decision
|
||
|
|
|
||
|
|
Two external libraries are composed:
|
||
|
|
|
||
|
|
**Retry: `github.com/avast/retry-go/v4`**
|
||
|
|
- Configured via `Config.MaxRetries` and `Config.RetryDelay`.
|
||
|
|
- Uses `retry.BackOffDelay` (exponential backoff) to avoid hammering a failing service.
|
||
|
|
- `retry.LastErrorOnly(true)` ensures only the final error from the retry loop is reported.
|
||
|
|
- Only HTTP 5xx responses trigger a retry. 4xx responses are not retried (they represent
|
||
|
|
caller errors, not server instability).
|
||
|
|
|
||
|
|
**Circuit breaker: `github.com/sony/gobreaker`**
|
||
|
|
- Configured via `Config.CBThreshold` (consecutive failures to trip) and `Config.CBTimeout`
|
||
|
|
(time in open state before transitioning to half-open).
|
||
|
|
- The retry loop runs inside the circuit breaker's `Execute` call. A full retry sequence
|
||
|
|
counts as one attempt from the circuit breaker's perspective only if all retries fail.
|
||
|
|
- When the circuit opens, `Do` returns `xerrors.ErrUnavailable` immediately, without
|
||
|
|
attempting the network call.
|
||
|
|
- State changes are logged via the duck-typed `Logger` interface.
|
||
|
|
|
||
|
|
The nesting order (circuit breaker wraps retry) is intentional: the circuit breaker
|
||
|
|
accumulates failures at the level of "did the request ultimately succeed after retries",
|
||
|
|
not at the level of individual attempts.
|
||
|
|
|
||
|
|
## Consequences
|
||
|
|
|
||
|
|
**Positive:**
|
||
|
|
- Transient failures are handled transparently by the caller.
|
||
|
|
- Sustained outages are detected quickly and the circuit opens, returning fast errors.
|
||
|
|
- Configuration is explicit and environment-variable driven.
|
||
|
|
- Circuit state changes are observable via logs.
|
||
|
|
|
||
|
|
**Negative:**
|
||
|
|
- Retry with backoff increases total latency for failing requests up to
|
||
|
|
`MaxRetries * RetryDelay * (2^MaxRetries - 1)` in the worst case.
|
||
|
|
- The circuit breaker counts only consecutive failures (`ConsecutiveFailures >= CBThreshold`),
|
||
|
|
not a rolling failure rate. Interleaved successes reset the counter.
|
||
|
|
- `gobreaker.ErrOpenState` is wrapped in `xerrors.ErrUnavailable`, so callers must check for
|
||
|
|
this specific code to distinguish circuit-open from normal 503 responses.
|