Faults

How to model operating conditions your workflow should survive.

Faults describe operating conditions around your workload. Use them to answer: “Does this workflow still behave correctly when the world around it is slow, unreliable, or interrupted?”

Fault models are how you test recovery paths without building a custom environment for every failure mode.

Availability

Network: available today.
Storage: coming soon.
Service shutdown and restart: coming soon.
Resource pressure: coming soon.
Clock and scheduler pressure: coming soon.

Faults vs mocks

Use mocks for what a service returns.

Use faults for the operating condition around the workflow.

Example:

Mock: the payment API returns 200, 503, or a declined charge.
Fault: requests are delayed, packets are lost, throughput is limited, or a dependency path is blocked.

Keep those concerns separate. It makes failures easier to understand.

Network faults

For CLI runs, named network faults live under:

.workers/fault/net/

For example:

.workers/fault/net/slow-network.json
.workers/fault/net/intermittent-loss.json
.workers/fault/net/payment-api-blocked.json

Validate a fault model before committing or running it:

wio validate fault .workers/fault/net/slow-network.json

Run a named fault with:

wio simulate create <project-id> \
  --command "python3 .workers/workloads/checkout.py" \
  --workload-path ".workers/workloads/checkout.py" \
  --faults slow-network \
  --depth 20

Start with one simple fault

Begin with a mild fault that should be survivable:

{
  "version": 1,
  "lo": {
    "default": {
      "delay": {
        "time_ms": 100,
        "jitter_ms": 25
      }
    }
  }
}

This is better than starting with a severe outage. Mild faults find bugs in retry timing, timeout budgets, and assumptions about fast responses.

Add packet loss

Use loss to check retry and idempotency behavior:

{
  "version": 1,
  "lo": {
    "default": {
      "delay": {
        "time_ms": 100,
        "jitter_ms": 25
      },
      "loss": {
        "kind": "random",
        "percent": 1
      }
    }
  }
}

Keep the first loss model small. A 1% loss rate is often enough to expose broken retries without making every run fail for the same obvious reason.

Target a service boundary

When a fault should only affect one service path, use a rule:

{
  "version": 1,
  "lo": {
    "rules": [
      {
        "match": {
          "dst": "127.0.0.1",
          "proto": "tcp",
          "dport": 8500
        },
        "delay": {
          "time_ms": 300,
          "jitter_ms": 50
        },
        "loss": {
          "kind": "random",
          "percent": 2
        }
      }
    ]
  }
}

Rules are evaluated in order. Use them when you want one dependency to be slow while the rest of the workflow stays normal.

Useful fault models

These are the fault model families that tend to find useful reliability bugs:

Network: latency, jitter, loss, corruption, duplication, reordering, rate limits, and blocked paths.
Storage: slow reads or writes, failed writes, full disks, partial persistence, and delayed flushes.
Service shutdown and restart: a worker, API server, queue consumer, or dependency exits and comes back.
Resource pressure: CPU contention, memory pressure, file descriptor exhaustion, and low worker capacity.
Clock and scheduler pressure: timer jumps, delayed jobs, lease expiry, and long gaps between scheduled work.

Network faults are available today. The other families are coming soon.

Common network fault types

Use these when they match a real user-facing risk:

Latency: requests take longer than expected.
Jitter: request timing varies.
Loss: some packets never arrive.
Rate limits: throughput is lower than usual.
Duplication: the same packet appears more than once.
Reordering: packets arrive out of order.
Blocked path: communication is blocked for a dependency or path.

Do not combine every fault at once. A focused fault gives you a failure you can understand and fix.

Good vs bad fault models

Good:

Add 100ms latency and 1% loss to the checkout dependency.

Bad:

Break all network traffic in every run.

The good fault tests a realistic degraded path. The bad fault may only prove that the workflow cannot run without any communication.

Good:

Start with mild delay, then create a separate stronger fault after the workload passes.

Bad:

Use one huge fault model that mixes latency, loss, rate limits, duplication, and blocked paths.

Small fault models make debugging faster. Stronger models are useful after the simple cases pass.

Best practices

Run a baseline first.
Give each fault model one clear purpose.
Start with realistic, survivable faults.
Increase severity gradually.
Target service boundaries when possible.
Keep fault files small and named by behavior.
Use the workload invariant to decide whether the system behaved correctly.
Inspect a failed run before changing the fault model.

The best fault model is not the most dramatic one. It is the one that finds a bug your users could hit and leaves enough evidence to fix it.