Why per-step SLOs are hard in DAG-based internal workflows
When internal workflows become DAGs (data pipelines, approval flows, enrichment jobs, backfills, report builds), teams often track only the overall runtime and failure rate. That’s useful, but it hides the real problem: one “slow” step can quietly eat most of the budget and degrade the whole workflow even if the final status is technically “success.”
Per-step SLOs solve that by making each node accountable. The catch is that many teams assume they need a custom orchestrator or a large platform-engineering project to enforce step budgets. In practice, you can get most of the value by using OpenTelemetry spans as the enforcement primitive: every step emits spans with consistent naming and attributes, and your tooling evaluates those spans against budgets.
Model a workflow step as an enforceable span
The key idea is simple: treat each DAG node as a span boundary, not just a log boundary. That means:
- One trace per workflow run (the entire DAG execution).
- One span per step (each node in the DAG).
- Span attributes that carry the enforcement context (step name, workflow name, environment, retry count, tenant, etc.).
In OpenTelemetry terms, a step span should have a stable, queryable identity. A practical convention is:
- Trace name: workflow identifier (e.g.,
workflow.invoice_reconcile). - Span name: step identifier (e.g.,
step.fetch_ledger). - Attributes:
workflow.name,step.name,run.id,attempt,dag.node_id,service.name,deployment.environment.
This gives you a durable contract: dashboards, alerts, and SLO evaluation can key off attributes instead of brittle string parsing.
Define per-step SLOs as budgets, not vibes
Per-step SLOs should be written like budgets you can enforce:
- Latency objective: e.g., p95 < 2s for
step.fetch_ledgerin production. - Error objective: e.g., < 0.5% span error rate for
step.enrich_customer. - Freshness/timeout objective (optional): e.g., hard timeout at 30s for a supplier API call step.
Two practical rules keep this from turning into an unmaintainable spreadsheet:
- Start with the steps that dominate runtime or incident load, not every node.
- Separate “steady-state” SLOs from “backfill” or “bulk” modes using attributes like
workflow.mode.
Instrumenting steps with OpenTelemetry spans
You do not need a custom orchestrator to emit spans. You need consistent instrumentation in the code that runs each node. If your workflow engine runs scripts, containers, or functions, each unit of work can create a span at the top of the step and close it at completion.
At a minimum, each step span should capture:
- Start/end time (automatic in spans).
- Status: OK vs ERROR.
- Failure reason as an attribute (sanitized), plus an event for the exception type.
- Retry metadata: attempt number, whether it’s a retry, and upstream dependency info.
One subtle but important point: retries can distort percentiles if you don’t model them carefully. Consider two approaches:
- Single span per logical step with events for retries (good for “user-visible” latency).
- One span per attempt with
attemptattribute (good for diagnosing flaky dependencies).
You can support both by nesting: a parent step span and child attempt spans.
Enforcement patterns without building a custom orchestrator
“Enforce” can mean several things operationally. OpenTelemetry spans let you implement enforcement as policy around execution, not as new scheduling software.
1) Fast feedback during execution with timeouts
The simplest enforcement is a hard timeout per step. If a step’s SLO is “must finish in 10s,” the code can enforce a 10s timeout and mark the span as ERROR on timeout. That prevents slow degradation from consuming downstream capacity.
2) Post-run gating and automatic triage
Some workflows shouldn’t fail just because a step missed its p95 budget once. Instead, evaluate spans after the run and choose an action:
- Open an incident or page if the burn rate is high for that step.
- Quarantine the workflow version if a new deploy caused systematic regression.
- Create an auto-ticket with the top slow spans and their attributes.
This is where span attributes pay off: you can automatically bucket regressions by step, tenant, dependency, or environment.
3) Dependency-aware budgets in DAGs
DAGs introduce a unique problem: some steps are allowed to be slow only if upstream steps are fast (or vice versa). You can express that through trace structure:
- Critical path analysis: use spans to compute which steps dominate the end-to-end runtime.
- Queue vs execution time: split spans into “queued” and “running” so you don’t punish a step for worker saturation.
This avoids blaming the wrong node when the real issue is scheduling or resource contention.
How Windmill fits naturally into this approach
Windmill is designed around DAG workflows and production monitoring, so it’s a natural place to standardize step instrumentation and SLO evaluation without building yet another orchestration layer. With a code-first workflow model and deep observability, you can keep the enforcement logic close to the steps themselves while still centralizing how you view and alert on spans.
If your team already exports traces and metrics, integrating workflow execution with OpenTelemetry-friendly conventions makes it much easier to build consistent SLO reporting across many scripts and services. Windmill also supports exporting to OpenTelemetry and Prometheus, which helps you keep vendor choice open while still enforcing step-level budgets in one workflow system. The project home is windmill.dev.
Operational details that make per-step SLOs actually work
Naming and cardinality discipline
Span attributes are powerful, but high-cardinality fields can explode cost and reduce signal. Keep step.name and workflow.name stable, and be cautious with raw user IDs or unbounded payload identifiers. If you need per-tenant breakdowns, use a controlled tenant.id that’s expected and limited.
Separate “step correctness” from “step performance”
A step can be correct but slow, or fast but wrong. Use spans for performance and reliability signals, and pair them with application-level checks (counts, invariants, row deltas) when correctness matters.
Alerts tied to actions
A per-step SLO is only useful if it changes what happens next. Make the action explicit: page, auto-rollback, throttle, route to a different worker group, or open a ticket with trace links. If the action is unclear, the SLO will become noise.
Two workflow maintainability patterns worth borrowing
Per-step SLOs also influence how you design the DAG. Branching steps with different performance characteristics should be explicit and named, not hidden in one “do_everything” node. If you’re refining how you structure these branches, the article on branching logic patterns to keep no-code workflows maintainable is a useful companion.
And if you’re dealing with urgent operational work triggered by step regressions, having a lightweight triage system matters as much as the tracing. The post on avoiding the priority inversion backlog trap maps well to how SLO burn alerts can otherwise hijack your roadmap.
What you get by treating spans as the enforcement layer
By elevating spans from “nice to have” observability to “the contract” for step budgets, you gain a shared language across teams: performance regressions are traceable to specific DAG nodes, SLO ownership becomes concrete, and enforcement can be implemented with timeouts, gates, and targeted alerts—without building a custom orchestrator just to answer “which step broke the budget?”



