docs(worker): correct tier from 2 to 3 and fix dependency tier refs
worker depends on launcher (now correctly Tier 2) and logz (Tier 1), placing it at Tier 3. The previous docs cited launcher as Tier 1 and logz as Tier 0, both of which were wrong.
This commit is contained in:
45
docs/adr/ADR-001-drain-with-timeout-shutdown.md
Normal file
45
docs/adr/ADR-001-drain-with-timeout-shutdown.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# ADR-001: Drain-with-Timeout Shutdown
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-03-18
|
||||
|
||||
## Context
|
||||
|
||||
A worker pool that stops abruptly risks silently dropping tasks that were already
|
||||
queued but not yet picked up by a goroutine. Conversely, waiting indefinitely for
|
||||
workers to finish is unsafe in production: a stuck task would prevent the process
|
||||
from exiting, blocking rolling deploys and causing orchestrators to send SIGKILL.
|
||||
|
||||
The `launcher` lifecycle protocol gives each component an `OnStop` hook. The worker
|
||||
pool must use that hook to drain cleanly while guaranteeing a bounded exit time.
|
||||
|
||||
## Decision
|
||||
|
||||
`OnStop` performs a three-step drain sequence:
|
||||
|
||||
1. **Close the task queue channel** (`close(w.taskQueue)`). This signals every
|
||||
goroutine that is `range`-ing over the channel to exit once the buffer is empty.
|
||||
No new tasks can be dispatched after this point — `Dispatch` would panic on a
|
||||
send to a closed channel, but by the time `OnStop` runs the service is already
|
||||
shutting down.
|
||||
2. **Cancel the pool context** (`w.cancel()`). Any task currently executing that
|
||||
respects its `ctx` argument will receive a cancellation signal and can return
|
||||
early.
|
||||
3. **Wait with a timeout**. A goroutine calls `w.wg.Wait()` and closes a `done`
|
||||
channel. `OnStop` then selects between `done` and `time.After(ShutdownTimeout)`.
|
||||
If `ShutdownTimeout` is zero the implementation falls back to 30 seconds. On
|
||||
timeout, an error is logged but `OnStop` returns `nil` so the launcher can
|
||||
continue shutting down other components.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Tasks already in the queue at shutdown time will execute (drain). Only tasks that
|
||||
have not been dispatched yet — or tasks that are stuck past the timeout — may be
|
||||
dropped.
|
||||
- The 30-second default matches common Kubernetes `terminationGracePeriodSeconds`
|
||||
defaults, making the behaviour predictable in containerised deployments.
|
||||
- `ShutdownTimeout` is configurable via `WORKER_SHUTDOWN_TIMEOUT` so operators can
|
||||
tune it per environment without code changes.
|
||||
- `OnStop` always returns `nil`; a timeout is surfaced as a logged error, not a
|
||||
returned error, so the launcher continues cleaning up other components even if
|
||||
workers are stuck.
|
||||
46
docs/adr/ADR-002-per-task-timeout.md
Normal file
46
docs/adr/ADR-002-per-task-timeout.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# ADR-002: Per-Task Timeout via Child Context
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-03-18
|
||||
|
||||
## Context
|
||||
|
||||
Worker tasks can call external services, run database queries, or perform other
|
||||
operations with unpredictable latency. A single slow or hung task occupying a
|
||||
goroutine indefinitely degrades overall pool throughput. Without a bounded
|
||||
execution time, one bad task can block a worker slot for the lifetime of the
|
||||
process.
|
||||
|
||||
At the same time, a blanket timeout should not be imposed when callers have not
|
||||
requested one — zero-timeout (polling or batch jobs) is a legitimate use case.
|
||||
|
||||
## Decision
|
||||
|
||||
`Config` exposes a `TaskTimeout time.Duration` field (env `WORKER_TASK_TIMEOUT`,
|
||||
default `0s`). Each worker goroutine checks this value before calling a task:
|
||||
|
||||
- If `TaskTimeout > 0`, a `context.WithTimeout(ctx, w.cfg.TaskTimeout)` child
|
||||
context is created and its `cancel` function is deferred after the call.
|
||||
- If `TaskTimeout == 0`, the pool root context is passed through unchanged and a
|
||||
no-op cancel function is used.
|
||||
|
||||
The task receives the (possibly deadline-bearing) context as its only `context.Context`
|
||||
argument. It is the task's responsibility to respect cancellation; the pool does not
|
||||
forcibly terminate goroutines.
|
||||
|
||||
`cancel()` is called immediately after the task returns, regardless of whether the
|
||||
task succeeded or failed, to release the timer resource promptly.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Tasks that respect `ctx.Done()` or pass `ctx` to downstream calls are automatically
|
||||
bounded by `TaskTimeout`.
|
||||
- Tasks that ignore their context will not be forcibly killed; the timeout becomes a
|
||||
best-effort signal only. This is a deliberate trade-off — Go does not support
|
||||
goroutine preemption.
|
||||
- Setting `TaskTimeout = 0` is a safe default: no deadline is added, and no timer
|
||||
resource is allocated per task.
|
||||
- `TaskTimeout` is independent of `ShutdownTimeout`. A task may have a 5-second
|
||||
execution timeout while the pool allows 30 seconds to drain during shutdown.
|
||||
- The timeout context is a child of the pool root context, so cancelling the pool
|
||||
(via `OnStop`) also cancels any running task context, regardless of `TaskTimeout`.
|
||||
53
docs/adr/ADR-003-channel-task-queue.md
Normal file
53
docs/adr/ADR-003-channel-task-queue.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# ADR-003: Channel-Based Buffered Task Queue
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-03-18
|
||||
|
||||
## Context
|
||||
|
||||
A worker pool requires a mechanism to hand off work from callers to goroutines.
|
||||
Common options include a mutex-protected slice, a ring buffer, or a Go channel.
|
||||
The pool must support multiple concurrent producers (callers of `Dispatch`) and
|
||||
multiple concurrent consumers (worker goroutines), while providing a simple
|
||||
backpressure signal when capacity is exhausted.
|
||||
|
||||
## Decision
|
||||
|
||||
The task queue is a buffered `chan Task` with capacity `Config.BufferSize` (env
|
||||
`WORKER_BUFFER_SIZE`, default 100). All worker goroutines receive from the same
|
||||
channel using `for task := range w.taskQueue`. Producers call `Dispatch` which
|
||||
uses a non-blocking `select` with a `default` branch:
|
||||
|
||||
```go
|
||||
select {
|
||||
case w.taskQueue <- task:
|
||||
return true
|
||||
default:
|
||||
// queue full — log and return false
|
||||
return false
|
||||
}
|
||||
```
|
||||
|
||||
`Dispatch` returns `bool`: `true` if the task was enqueued, `false` if the queue
|
||||
was full. The caller decides what to do with a rejected task (retry, log, discard).
|
||||
|
||||
Closing the channel in `OnStop` is the drain signal: `range` over a closed channel
|
||||
drains buffered items and then exits naturally, so no separate "stop" message is
|
||||
needed.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The channel scheduler distributes tasks across all `PoolSize` goroutines without
|
||||
any additional synchronisation code.
|
||||
- Backpressure is explicit: a full queue returns `false` rather than blocking the
|
||||
caller or growing unboundedly. Callers that must not drop tasks should implement
|
||||
retry logic at their layer.
|
||||
- Channel capacity is fixed at construction time. There is no dynamic resizing; if
|
||||
workload consistently fills the buffer, `BufferSize` or `PoolSize` must be tuned
|
||||
in config.
|
||||
- Closing the channel is a one-way signal: once `OnStop` closes it, `Dispatch` must
|
||||
not be called again. This is safe in practice because `launcher` ensures `OnStop`
|
||||
is only called after the application has stopped dispatching work, but there is no
|
||||
runtime guard against misuse.
|
||||
- The `for range` pattern requires no sentinel values and is idiomatic Go for
|
||||
fan-out worker pools.
|
||||
Reference in New Issue
Block a user