# Faults

How to model operating conditions your workflow should survive.

Faults describe operating conditions around your workload. Use them to answer: "Does this workflow still behave correctly when the world around it is slow, unreliable, or interrupted?"

Fault models are how you test recovery paths without building a custom environment for every failure mode.

## Availability

- **Network**: available today.
- **Storage**: coming soon.
- **Service shutdown and restart**: coming soon.
- **Resource pressure**: coming soon.
- **Clock and scheduler pressure**: coming soon.

## Faults vs mocks

Use mocks for what a service returns.

Use faults for the operating condition around the workflow.

Example:

- Mock: the payment API returns `200`, `503`, or a declined charge.
- Fault: requests are delayed, packets are lost, throughput is limited, or a dependency path is blocked.

Keep those concerns separate. It makes failures easier to understand.

## Network faults

For CLI runs, named network faults live under:

```text
.workers/fault/net/
```

For example:

```text
.workers/fault/net/slow-network.json
.workers/fault/net/intermittent-loss.json
.workers/fault/net/payment-api-blocked.json
```

Validate a fault model before committing or running it:

```bash
wio validate fault .workers/fault/net/slow-network.json
```

Run a named fault with:

```bash
wio simulate create <project-id> \
  --command "python3 .workers/workloads/checkout.py" \
  --workload-path ".workers/workloads/checkout.py" \
  --faults slow-network \
  --depth 20
```

## Start with one simple fault

Begin with a mild fault that should be survivable:

```json
{
  "version": 1,
  "lo": {
    "default": {
      "delay": {
        "time_ms": 100,
        "jitter_ms": 25
      }
    }
  }
}
```

This is better than starting with a severe outage. Mild faults find bugs in retry timing, timeout budgets, and assumptions about fast responses.

## Add packet loss

Use loss to check retry and idempotency behavior:

```json
{
  "version": 1,
  "lo": {
    "default": {
      "delay": {
        "time_ms": 100,
        "jitter_ms": 25
      },
      "loss": {
        "kind": "random",
        "percent": 1
      }
    }
  }
}
```

Keep the first loss model small. A 1% loss rate is often enough to expose broken retries without making every run fail for the same obvious reason.

## Target a service boundary

When a fault should only affect one service path, use a rule:

```json
{
  "version": 1,
  "lo": {
    "rules": [
      {
        "match": {
          "dst": "127.0.0.1",
          "proto": "tcp",
          "dport": 8500
        },
        "delay": {
          "time_ms": 300,
          "jitter_ms": 50
        },
        "loss": {
          "kind": "random",
          "percent": 2
        }
      }
    ]
  }
}
```

Rules are evaluated in order. Use them when you want one dependency to be slow while the rest of the workflow stays normal.

## Useful fault models

These are the fault model families that tend to find useful reliability bugs:

- **Network**: latency, jitter, loss, corruption, duplication, reordering, rate limits, and blocked paths.
- **Storage**: slow reads or writes, failed writes, full disks, partial persistence, and delayed flushes.
- **Service shutdown and restart**: a worker, API server, queue consumer, or dependency exits and comes back.
- **Resource pressure**: CPU contention, memory pressure, file descriptor exhaustion, and low worker capacity.
- **Clock and scheduler pressure**: timer jumps, delayed jobs, lease expiry, and long gaps between scheduled work.

Network faults are available today. The other families are coming soon.

## Common network fault types

Use these when they match a real user-facing risk:

- **Latency**: requests take longer than expected.
- **Jitter**: request timing varies.
- **Loss**: some packets never arrive.
- **Rate limits**: throughput is lower than usual.
- **Duplication**: the same packet appears more than once.
- **Reordering**: packets arrive out of order.
- **Blocked path**: communication is blocked for a dependency or path.

Do not combine every fault at once. A focused fault gives you a failure you can understand and fix.

## Good vs bad fault models

Good:

```text
Add 100ms latency and 1% loss to the checkout dependency.
```

Bad:

```text
Break all network traffic in every run.
```

The good fault tests a realistic degraded path. The bad fault may only prove that the workflow cannot run without any communication.

Good:

```text
Start with mild delay, then create a separate stronger fault after the workload passes.
```

Bad:

```text
Use one huge fault model that mixes latency, loss, rate limits, duplication, and blocked paths.
```

Small fault models make debugging faster. Stronger models are useful after the simple cases pass.

## Best practices

- Run a baseline first.
- Give each fault model one clear purpose.
- Start with realistic, survivable faults.
- Increase severity gradually.
- Target service boundaries when possible.
- Keep fault files small and named by behavior.
- Use the workload invariant to decide whether the system behaved correctly.
- Inspect a failed run before changing the fault model.

The best fault model is not the most dramatic one. It is the one that finds a bug your users could hit and leaves enough evidence to fix it.