Workers IO workers.io
← blog

september 2, 2025 · memo

Software Correctness on Autopilot

How formal and semi-formal methods, combined with AI agents, deliver mathematical guarantees for software correctness — making verification automatic, exhaustive, and built into the workflow.

by Chaitanya · founder, workers io


What we do

Workers IO agents help teams build critical software with significantly higher correctness. By applying formal and semi-formal methods, we rigorously verify code behavior — making our approach ideal for designing and delivering reliable, safe, and high-integrity systems.

These approaches have ensured reliability in safety-critical systems for decades and are credited by large cloud providers for their operational stability. Unlike general software testing, our use of mathematical proofs addresses the most critical behaviors with a degree of confidence that testing alone cannot provide.

Why software fails

Software failures typically result from the accumulation of minor, tolerated imperfections rather than a single dramatic error. These issues often converge at critical moments.

Tolerated failure

Minor defects are often tolerated because they seem insignificant on their own. Over time and across components, these risks accumulate and can result in major outages.

Creation is easy; correctness is hard

LLMs have reduced the cost of code generation to near zero, enabling rapid development. However, ensuring correctness remains expensive. We are entering a world of running software that no human has read. Critical systems in aviation, finance, and healthcare demand software that is reliable under pressure and at scale. The main challenge is no longer writing code, but defining and proving the essential properties that must always hold.

No natural constraints

Physical systems are constrained by the laws of physics, while software operates within an effectively unlimited state space. To maintain reliability, software behavior must be governed by explicit invariants, contracts, and types. Without these controls, systems can evolve beyond our ability to understand and manage them — a problem that AI-generated code dramatically accelerates.

LLMs are reshaping software itself

LLMs are not just a faster way to write software. They are reshaping the economics and architecture of software itself. When rewriting and understanding large codebases becomes cheap, the incentive to rely on deep dependency trees collapses. Legacy code loses its moat — entire codebases can be studied, rewritten, and upgraded in situations where humans would have given up long ago. The same shift is visible at the product layer — teams are embedding AI at progressively deeper levels in their SaaS, moving from simple input boxes toward fully generated, per-customer applications.

LLMs are especially effective at code translation compared to de-novo generation, because the original codebase acts as a highly detailed prompt and provides a concrete reference to test against. This is already visible in the momentum behind porting C to Rust and upgrading legacy COBOL systems. It is likely that large fractions of all software ever written will be rewritten many times over.

As the human factor in programming diminishes, selection pressure shifts toward languages that favor formal verifiability, type safety, and performance — properties often harder for humans to learn but well-suited to LLMs, which thrive in environments with verifiable rewards. Even Rust, despite its impressive type system, was designed around human cognitive constraints. What an optimal target language for LLM-generated, formally verified code looks like remains an open and exciting question.

In this landscape, formal verification is not optional — it is essential. The catch is that unknown unknowns remain unknown, and verification can only prove what you specify. But the alternative — unverified AI-generated code at scale — is far worse.

Cost of software failure

Software failure is a business and human-life risk, not just a technical risk. In July 2024, CrowdStrike’s update caused multi-day global disruptions; one airline alone reported $380M in lost revenue over five days, and the human lives lost during that incident are non-zero.

Such failures will become more common with LLMs writing most of the code without strict guardrails that can be verified with a high degree of accuracy.

Calculate your failure cost

If you have not estimated each category, you may be underestimating the total impact.

Lost revenue

$330K–$5M+ per hour for large enterprises and regulated verticals.

Cost to fix

Engineer days diverted from roadmap work — whether priced as comp or revenue-per-engineer.

Regulatory & legal

Fines, SLA penalties, and settlements routinely reach eight figures annually for Global 2000 firms.

Reputation & trust

Lower LTV, churn, brand repair spend often exceeding $10M/year. Big customers pause or leave.

Organizational disruption

War rooms, context switching, and attrition — especially among your best engineers.

Investor volatilityPublic companies often experience immediate drawdowns of 1–9%.

Verification vs traditional testing

Traditional Testing

  • Covers only scenarios you think to write
  • Samples a small subset of the state space
  • Blind spots where bugs hide
  • Confidence in some cases

Software Verification

  • Explores all possible execution paths
  • Exhaustive against defined invariants
  • Catches subtle logic and concurrency bugs
  • Strong assurance across all specified cases

Verification takes a fundamentally different approach. Instead of sampling a few paths, it mathematically explores all possible execution paths against the rules and invariants you define. You don’t just gain confidence — you obtain a strong assurance, or a precise counterexample showing exactly where and how it fails.

Delivering correctness

We capture required software behavior as executable specifications at different layers of your codebase.

System Correctness

We build a model of your system and use a model checker to explore it. You either get confirmation that your defined invariants hold, or you’re shown counterexample traces that reveal exactly how it fails. These traces make flaws obvious early and keep design discussions grounded in concrete reality.

Code Correctness

We apply formal proofs directly to the code itself. Every change is checked automatically. If a rule is broken, you receive a precise failure trace and a minimal patch to fix it. If it holds, you get a mathematical proof of correctness for the specified properties.

Maintaining Correctness

Specifications live alongside your code and evolve naturally with your system. They gate merges and catch regressions. In production, we continuously monitor for drift by comparing execution logs against the specifications — giving you immediate visibility when reality diverges from intent.

Open challenges

We believe in being transparent about the limits of what verification can do today.

Specifications don’t capture everything.

You are only proving what you specify. A program proven correct in some aspects should not be assumed correct in all aspects. This is why we combine verification with monitoring and testing — defense in depth.

Proofs can be fragile.

When code changes, proofs can break. A small change to a spec can sometimes require significant rework. AI agents are making this dramatically more tractable — they have infinite patience for re-proving and can be parallelized — but it remains an active area of improvement.

The toolchain is still maturing.

There is a meaningful gap between academic verification tools and mainstream programming languages. Domain-specific solutions exist today, but general-purpose verified programming at scale is still emerging.

Where we are going

Software engineering is changing rapidly. We believe there is a future in which the primary artifact is not code but specifications — succinct, version-controlled statements of intent that capture what the software must do, independent of how it is implemented. When that happens, software correctness and verification will become the central discipline.

We are in a phase where the cost of software creation is approaching zero, and the next challenge is ensuring software correctness. We are building the tools necessary for agents to generate code and verify the desired behavior with high rigor. When we achieve this, we will be able to deliver software more quickly and with greater impact, solving the tough problems that matter to people’s lives.

Frequently asked questions

Aren’t these techniques only meant for critical systems?

No. The benefits extend to general software correctness, performance, and reliability. AWS teams use formal methods to accelerate delivery and catch subtle bugs in ordinary cloud services.

What benefits can I expect?

Significantly fewer failures, backed by mathematical assurances for specified properties. Additional benefits include lower resource usage, enhanced performance, and faster deployment cycles.

Do I need to maintain specifications?

No. We manage specification writing, ongoing maintenance as your codebase evolves, and enforcement during both design and production.

How many engineering hours are required?

None. Our AI correctness workers operate fully autonomously:

Generate Specs
Monitor Changes
Verify Software
Suggest Fixes
Do I need existing testing infrastructure?

No. We manage setup, and the system is self-sustaining and maintenance-free.


more posts