docs(worker): correct tier from 2 to 3 and fix dependency tier refs

worker depends on launcher (now correctly Tier 2) and logz (Tier 1), placing it at Tier 3. The previous docs cited launcher as Tier 1 and logz as Tier 0, both of which were wrong.
2026-03-19 13:13:41 +00:00
commit 631c98396e
14 changed files with 713 additions and 0 deletions
--- a/docs/adr/ADR-001-drain-with-timeout-shutdown.md
+++ b/docs/adr/ADR-001-drain-with-timeout-shutdown.md
@@ -0,0 +1,45 @@
+# ADR-001: Drain-with-Timeout Shutdown
+
+**Status:** Accepted
+**Date:** 2026-03-18
+
+## Context
+
+A worker pool that stops abruptly risks silently dropping tasks that were already
+queued but not yet picked up by a goroutine. Conversely, waiting indefinitely for
+workers to finish is unsafe in production: a stuck task would prevent the process
+from exiting, blocking rolling deploys and causing orchestrators to send SIGKILL.
+
+The `launcher` lifecycle protocol gives each component an `OnStop` hook. The worker
+pool must use that hook to drain cleanly while guaranteeing a bounded exit time.
+
+## Decision
+
+`OnStop` performs a three-step drain sequence:
+
+1. **Close the task queue channel** (`close(w.taskQueue)`). This signals every
+   goroutine that is `range`-ing over the channel to exit once the buffer is empty.
+   No new tasks can be dispatched after this point — `Dispatch` would panic on a
+   send to a closed channel, but by the time `OnStop` runs the service is already
+   shutting down.
+2. **Cancel the pool context** (`w.cancel()`). Any task currently executing that
+   respects its `ctx` argument will receive a cancellation signal and can return
+   early.
+3. **Wait with a timeout**. A goroutine calls `w.wg.Wait()` and closes a `done`
+   channel. `OnStop` then selects between `done` and `time.After(ShutdownTimeout)`.
+   If `ShutdownTimeout` is zero the implementation falls back to 30 seconds. On
+   timeout, an error is logged but `OnStop` returns `nil` so the launcher can
+   continue shutting down other components.
+
+## Consequences
+
+- Tasks already in the queue at shutdown time will execute (drain). Only tasks that
+  have not been dispatched yet — or tasks that are stuck past the timeout — may be
+  dropped.
+- The 30-second default matches common Kubernetes `terminationGracePeriodSeconds`
+  defaults, making the behaviour predictable in containerised deployments.
+- `ShutdownTimeout` is configurable via `WORKER_SHUTDOWN_TIMEOUT` so operators can
+  tune it per environment without code changes.
+- `OnStop` always returns `nil`; a timeout is surfaced as a logged error, not a
+  returned error, so the launcher continues cleaning up other components even if
+  workers are stuck.