worker depends on launcher (now correctly Tier 2) and logz (Tier 1), placing it at Tier 3. The previous docs cited launcher as Tier 1 and logz as Tier 0, both of which were wrong.
2.2 KiB
2.2 KiB
ADR-001: Drain-with-Timeout Shutdown
Status: Accepted Date: 2026-03-18
Context
A worker pool that stops abruptly risks silently dropping tasks that were already queued but not yet picked up by a goroutine. Conversely, waiting indefinitely for workers to finish is unsafe in production: a stuck task would prevent the process from exiting, blocking rolling deploys and causing orchestrators to send SIGKILL.
The launcher lifecycle protocol gives each component an OnStop hook. The worker
pool must use that hook to drain cleanly while guaranteeing a bounded exit time.
Decision
OnStop performs a three-step drain sequence:
- Close the task queue channel (
close(w.taskQueue)). This signals every goroutine that isrange-ing over the channel to exit once the buffer is empty. No new tasks can be dispatched after this point —Dispatchwould panic on a send to a closed channel, but by the timeOnStopruns the service is already shutting down. - Cancel the pool context (
w.cancel()). Any task currently executing that respects itsctxargument will receive a cancellation signal and can return early. - Wait with a timeout. A goroutine calls
w.wg.Wait()and closes adonechannel.OnStopthen selects betweendoneandtime.After(ShutdownTimeout). IfShutdownTimeoutis zero the implementation falls back to 30 seconds. On timeout, an error is logged butOnStopreturnsnilso the launcher can continue shutting down other components.
Consequences
- Tasks already in the queue at shutdown time will execute (drain). Only tasks that have not been dispatched yet — or tasks that are stuck past the timeout — may be dropped.
- The 30-second default matches common Kubernetes
terminationGracePeriodSecondsdefaults, making the behaviour predictable in containerised deployments. ShutdownTimeoutis configurable viaWORKER_SHUTDOWN_TIMEOUTso operators can tune it per environment without code changes.OnStopalways returnsnil; a timeout is surfaced as a logged error, not a returned error, so the launcher continues cleaning up other components even if workers are stuck.