# ADR-001: Drain-with-Timeout Shutdown **Status:** Accepted **Date:** 2026-03-18 ## Context A worker pool that stops abruptly risks silently dropping tasks that were already queued but not yet picked up by a goroutine. Conversely, waiting indefinitely for workers to finish is unsafe in production: a stuck task would prevent the process from exiting, blocking rolling deploys and causing orchestrators to send SIGKILL. The `launcher` lifecycle protocol gives each component an `OnStop` hook. The worker pool must use that hook to drain cleanly while guaranteeing a bounded exit time. ## Decision `OnStop` performs a three-step drain sequence: 1. **Close the task queue channel** (`close(w.taskQueue)`). This signals every goroutine that is `range`-ing over the channel to exit once the buffer is empty. No new tasks can be dispatched after this point — `Dispatch` would panic on a send to a closed channel, but by the time `OnStop` runs the service is already shutting down. 2. **Cancel the pool context** (`w.cancel()`). Any task currently executing that respects its `ctx` argument will receive a cancellation signal and can return early. 3. **Wait with a timeout**. A goroutine calls `w.wg.Wait()` and closes a `done` channel. `OnStop` then selects between `done` and `time.After(ShutdownTimeout)`. If `ShutdownTimeout` is zero the implementation falls back to 30 seconds. On timeout, an error is logged but `OnStop` returns `nil` so the launcher can continue shutting down other components. ## Consequences - Tasks already in the queue at shutdown time will execute (drain). Only tasks that have not been dispatched yet — or tasks that are stuck past the timeout — may be dropped. - The 30-second default matches common Kubernetes `terminationGracePeriodSeconds` defaults, making the behaviour predictable in containerised deployments. - `ShutdownTimeout` is configurable via `WORKER_SHUTDOWN_TIMEOUT` so operators can tune it per environment without code changes. - `OnStop` always returns `nil`; a timeout is surfaced as a logged error, not a returned error, so the launcher continues cleaning up other components even if workers are stuck.