46 lines
2.2 KiB
Markdown
46 lines
2.2 KiB
Markdown
|
|
# ADR-001: Drain-with-Timeout Shutdown
|
||
|
|
|
||
|
|
**Status:** Accepted
|
||
|
|
**Date:** 2026-03-18
|
||
|
|
|
||
|
|
## Context
|
||
|
|
|
||
|
|
A worker pool that stops abruptly risks silently dropping tasks that were already
|
||
|
|
queued but not yet picked up by a goroutine. Conversely, waiting indefinitely for
|
||
|
|
workers to finish is unsafe in production: a stuck task would prevent the process
|
||
|
|
from exiting, blocking rolling deploys and causing orchestrators to send SIGKILL.
|
||
|
|
|
||
|
|
The `launcher` lifecycle protocol gives each component an `OnStop` hook. The worker
|
||
|
|
pool must use that hook to drain cleanly while guaranteeing a bounded exit time.
|
||
|
|
|
||
|
|
## Decision
|
||
|
|
|
||
|
|
`OnStop` performs a three-step drain sequence:
|
||
|
|
|
||
|
|
1. **Close the task queue channel** (`close(w.taskQueue)`). This signals every
|
||
|
|
goroutine that is `range`-ing over the channel to exit once the buffer is empty.
|
||
|
|
No new tasks can be dispatched after this point — `Dispatch` would panic on a
|
||
|
|
send to a closed channel, but by the time `OnStop` runs the service is already
|
||
|
|
shutting down.
|
||
|
|
2. **Cancel the pool context** (`w.cancel()`). Any task currently executing that
|
||
|
|
respects its `ctx` argument will receive a cancellation signal and can return
|
||
|
|
early.
|
||
|
|
3. **Wait with a timeout**. A goroutine calls `w.wg.Wait()` and closes a `done`
|
||
|
|
channel. `OnStop` then selects between `done` and `time.After(ShutdownTimeout)`.
|
||
|
|
If `ShutdownTimeout` is zero the implementation falls back to 30 seconds. On
|
||
|
|
timeout, an error is logged but `OnStop` returns `nil` so the launcher can
|
||
|
|
continue shutting down other components.
|
||
|
|
|
||
|
|
## Consequences
|
||
|
|
|
||
|
|
- Tasks already in the queue at shutdown time will execute (drain). Only tasks that
|
||
|
|
have not been dispatched yet — or tasks that are stuck past the timeout — may be
|
||
|
|
dropped.
|
||
|
|
- The 30-second default matches common Kubernetes `terminationGracePeriodSeconds`
|
||
|
|
defaults, making the behaviour predictable in containerised deployments.
|
||
|
|
- `ShutdownTimeout` is configurable via `WORKER_SHUTDOWN_TIMEOUT` so operators can
|
||
|
|
tune it per environment without code changes.
|
||
|
|
- `OnStop` always returns `nil`; a timeout is surfaced as a logged error, not a
|
||
|
|
returned error, so the launcher continues cleaning up other components even if
|
||
|
|
workers are stuck.
|