Why Specs Ship with Tests: The Rejection Log as Audit Trail

Most software has a trust problem. You deploy code to production. Days later, something breaks in ways your tests didn’t catch. You add more tests. The coverage gets better. But the gap never closes. There’s always a blind spot between what your tests check and what the code actually does in the world.

Specs solve this differently. They don’t try to close the gap. They flip the model.

Instead of “write code first, test second,” specs say: “test cases ARE the spec.” The tests don’t verify the code. The tests define what the code must do. When a spec runs, the success rate is public, deterministic, and impossible to fake. That is the ground truth.

This is why every spec in SpecMarket ships with a test suite. It is not an afterthought. It is not “best practice.” It is foundational.

The problem with SaaS trust

When you subscribe to DocuSign, you trust that:

Their e-signature verification works
Your documents are stored securely
The API will be up tomorrow
They won’t change pricing or shut down

You don’t verify any of this yourself. You rely on DocuSign’s reputation, their compliance certifications, their track record. If something goes wrong, they fix it on their schedule. You wait.

The trust model is relational. It depends on the vendor’s honesty, competence, and continued existence. It also depends on alignment — DocuSign optimizes for SaaS retention, which means they don’t optimize for your specific use case. They build features for the average customer. You are not average.

This creates a two-stage failure mode:

Hidden failures: Bugs that only affect your specific workflow
Trust drift: Vendors change features, pricing, or shut down entirely

Both are invisible until they hurt.

How specs invert the model

A spec is a Docker container with:

Functional specification: What the code must do
Test cases: Machine-evaluable success criteria
Tech stack requirements: Dependencies, versions, constraints
Code patterns: How the implementation should be structured

When you run a spec, you get:

Binary outcome: Pass or fail (on every run, on every machine)
Success rate: Percentage of runs that pass
Failure log: Every test that failed, with the input and output
Deterministic proof: The test output is cryptographic evidence of what the spec does

This is not reputation-based trust. This is verification-based trust.

If you run a DocuSign spec 100 times and it works 97 times, you now know:

The spec solves e-signatures with 97% reliability in your environment
The 3 failures are documented — you can see the inputs that cause failure
You don’t need DocuSign to trust the result — the test output proves it

You can audit the spec. You can fork it. You can run it offline. You can modify it. The tests remain the contract.

Why test-first design matters at scale

Traditional testing is detective work. You write code, then you write tests to find bugs. You run the tests. Some pass, some fail. You fix the code. You run again. Repeat until coverage reaches some magic number (80%? 95%?).

The assumption is: once you hit N% test coverage, the code is “good enough.”

But “enough” is context-dependent. A scheduling tool failing 1% of the time is acceptable. A medical device failing 1% of the time is catastrophic. A payment system failing 1% of the time is $864,000/day in lost transactions.

Test-first specs flip the burden. Instead of asking “how much testing is enough?”, you ask “what is the failure rate for my specific use case?”

A spec that runs e-signature extraction on 1,000 documents and succeeds 994 times tells you:

Failure rate is 0.6%
Those 6 failures are documented (PDF type, signature format, etc.)
You know if 0.6% is acceptable for your workflow

This is not theoretical confidence. This is measured evidence.

When you chain specs together (spec A extracts signatures, spec B validates document state, spec C records the result in your database), each link in the chain has a documented success rate. The system-level reliability emerges from the mathematics: if each step is 99% reliable, the full pipeline is 99% × 99% × 99% = 97% reliable. You can calculate it.

SaaS vendors don’t publish their failure rates. They publish uptime (99.9% uptime is “the service is available,” not “the feature works correctly for your use case”). The actual success rate for your specific workflow is invisible.

The rejection log is the audit trail

Here is a subtle but critical insight: rejection is transparency.

In traditional software, failed tests are noise. “Oh, the tests failed in CI. That developer will fix it.” Failures are private, internal, a sign that someone made a mistake.

In spec-based systems, failed test runs are public data. When Agent A runs a spec and 8% of cases fail, that 8% is recorded. Other agents can see it. You can see it. The spec creator can see it.

This creates an unexpected benefit: specs that fail transparently are more trustworthy than systems that claim perfection but hide their gaps.

A spec that says “97% success rate, and here are the 3% failure modes” is more useful than a SaaS tool that says “99.99% uptime” while your actual integration succeeds 92% of the time.

The rejection log is the audit trail. Every failure is documented. You don’t need to trust the vendor’s explanation — you have evidence.

Why this changes autonomy

Here is where it gets really interesting: test-first specs enable autonomous agent execution without human bottlenecks.

Traditionally, when an agent does work, you verify it. The agent extracts data, you review it. The agent makes a decision, you approve it. The agent runs code, you test it. Every action requires human review because you don’t trust the agent’s self-assessment.

Test-first specs invert this. The agent runs the spec. The tests run. The success rate is output. You don’t review the agent’s judgment — you review the test results. And test results are deterministic.

This is why agent autonomy becomes safe with specs:

Agent does work without approval
Tests verify the output
Success rate is public
You can audit the failures

No human in the loop slowing down the process. No trust required. Just measurable evidence.

The spectrum of specs

Specs on SpecMarket span different reliability tiers:

Emerging specs (beta): 40-60% success rate. Useful for exploratory work, but not production.
Stable specs: 85-95% success rate. Good for integration with manual review of edge cases.
Hardened specs: 96%+ success rate, documented failure modes. Production-ready for most use cases.
Specialized specs: Domain-specific solutions (audit extraction, form parsing, etc.) with 99%+ success rate in their niche.

The spec marketplace sorts by success rate and user volume. A spec with 1,000 runs at 97% success is more credible than a spec with 50 runs at 100% success. The numbers are transparent.

SaaS vendors don’t publish this data because they can’t — the definition of “success” is too context-dependent. A CRM success rate depends on your data model, your workflows, your integrations. Each customer sees different reliability.

Specs eliminate this ambiguity. The spec author defines what success means (passing test cases). Users measure it. The community votes with their confidence (high-success specs get adopted, low-success specs get forked and improved).

What this means for builders

If you are building a spec, ship it with tests. The tests are not validation. They are your spec’s identity. They are what define success.

If you are choosing a spec, check the success rate. Not the vendor’s promises. Not the feature list. The actual measured reliability across real runs.

If you are building against specs (using them in your workflow), you can compose them without fear of hidden failures. Each component has a documented reliability. The system-level risk is calculable.

This is the shift: from trust-based (believe the vendor) to measurement-based (verify the output).

Specs ship with tests because tests are not a feature of specs. Tests are what make specs specs.