Files
worker/docs/adr/ADR-001-drain-with-timeout-shutdown.md
Rene Nochebuena 631c98396e docs(worker): correct tier from 2 to 3 and fix dependency tier refs
worker depends on launcher (now correctly Tier 2) and logz (Tier 1),
placing it at Tier 3. The previous docs cited launcher as Tier 1 and
logz as Tier 0, both of which were wrong.
2026-03-19 13:13:41 +00:00

2.2 KiB

ADR-001: Drain-with-Timeout Shutdown

Status: Accepted Date: 2026-03-18

Context

A worker pool that stops abruptly risks silently dropping tasks that were already queued but not yet picked up by a goroutine. Conversely, waiting indefinitely for workers to finish is unsafe in production: a stuck task would prevent the process from exiting, blocking rolling deploys and causing orchestrators to send SIGKILL.

The launcher lifecycle protocol gives each component an OnStop hook. The worker pool must use that hook to drain cleanly while guaranteeing a bounded exit time.

Decision

OnStop performs a three-step drain sequence:

  1. Close the task queue channel (close(w.taskQueue)). This signals every goroutine that is range-ing over the channel to exit once the buffer is empty. No new tasks can be dispatched after this point — Dispatch would panic on a send to a closed channel, but by the time OnStop runs the service is already shutting down.
  2. Cancel the pool context (w.cancel()). Any task currently executing that respects its ctx argument will receive a cancellation signal and can return early.
  3. Wait with a timeout. A goroutine calls w.wg.Wait() and closes a done channel. OnStop then selects between done and time.After(ShutdownTimeout). If ShutdownTimeout is zero the implementation falls back to 30 seconds. On timeout, an error is logged but OnStop returns nil so the launcher can continue shutting down other components.

Consequences

  • Tasks already in the queue at shutdown time will execute (drain). Only tasks that have not been dispatched yet — or tasks that are stuck past the timeout — may be dropped.
  • The 30-second default matches common Kubernetes terminationGracePeriodSeconds defaults, making the behaviour predictable in containerised deployments.
  • ShutdownTimeout is configurable via WORKER_SHUTDOWN_TIMEOUT so operators can tune it per environment without code changes.
  • OnStop always returns nil; a timeout is surfaced as a logged error, not a returned error, so the launcher continues cleaning up other components even if workers are stuck.