Failure Discovery Action Plan

Status: Design sketch

Calciforge already has a lot of tests. Recent staging bugs show that count is not the same thing as confidence. The recurring problem is narrower: tests often cover one observed shape, while production breaks on another valid shape emitted by a real agent, gateway, package manager, filesystem, service manager, or channel.

This page records the lesson and turns it into work. The aim is not only more regression tests. Calciforge needs aggression tests: checks that deliberately break protocols, runtime state, config paths, and trust assumptions in ways we can already foresee.

Post-mortem: Helicone Streaming Response Failure

What happened

OpenClaw sent a normal chat-completions request through Calciforge with stream=true and tools enabled. Helicone returned a valid text/event-stream response. Calciforge treated the upstream body as a single JSON chat-completion object and failed to decode it.

From the user’s view, the local OpenClaw agent timed out, retried, and then looked wedged. The services were mostly alive; the contract between two live components was wrong.

Why the tests missed it

Fix

Calciforge now accepts upstream text/event-stream chat-completion responses from Helicone and folds them into the existing internal ChatCompletionResponse type. Regression tests cover streamed content and streamed tool-call argument chunks while preserving stream=true in the outbound request.

This is a compatibility fix, not the final streaming design. Today the adapter still aggregates the upstream stream before Calciforge emits its response. True token-through streaming needs a wider gateway trait and handler change.

Pattern: The Bugs We Keep Finding Late

Recent failures tend to fall into a few buckets:

Aggression Testing

Boundary-contract rule

Every surface that accepts data, bytes, process output, filesystem state, time, or network behavior Calciforge does not control must declare two contracts:

This is deliberately not a test of a single client’s behavior. The unit under test is the boundary contract. A test that only proves one mocked client response still works is not enough for a boundary.

Bounded and unbounded behavior

For each boundary, classify behavior before writing tests:

The first category gets table-driven regressions. The second gets fuzzing and chaos. The third gets golden fixtures and structured generators. The fourth gets property tests, differential checks, and integration simulators.

Two required test tiers

Every boundary must eventually have both tiers:

  1. Invalid-data containment tests. Feed malformed or adversarial inputs and assert the system remains bounded: no panic, no hang, no secret exposure, no durable corruption, no authorization bypass, clear failure classification, and bounded resource use.
  2. Valid-data correctness tests. Generate or replay valid inputs with many legal shapes and assert the normalized user-visible result is correct. These tests should use protocol-level or domain-level oracles, not private helper state.

Property tests such as Hegel/proptest are one layer. They do not replace raw byte fuzzing, real fixture corpora, mock upstream servers, subprocess simulators, filesystem chaos, or deterministic and fault-injected integration tests. Use Hegel where the input space is typed and the invariant is crisp; use cargo-fuzz/arbitrary where bytes cross a parser boundary; use simulators where ordering, timing, retries, and partial failure matter.

Property quality is the gating issue. A generated test only earns its keep when the generator covers the boundary’s real input grammar and the oracle checks the boundary’s contract instead of a private implementation detail. Every boundary test should state which layer it belongs to:

For each external boundary, the invalid generator and valid generator are different artifacts. Invalid-data tests should aggressively cover nonsense and near-miss inputs while proving containment. Valid-data tests should generate the entire legal shape space and prove the user-visible result is correct.

Integration boundary inventory

Surface Bounded behavior Unbounded behavior Tier 1 invalid containment Tier 2 valid correctness
OpenAI-compatible model gateway and provider adapters Request schema, auth headers, model selectors, content type, provider status class SSE frame boundaries, extra fields, provider-specific error bodies, huge/non-UTF8 bodies, transport timing Fuzz JSON/SSE parsers, provider error envelopes, headers, timeout/read errors Generate valid JSON/SSE/tool-call streams, replay provider corpora, assert normalized ChatCompletionResponse and failure classification
Agent adapters: OpenClaw, ZeroClaw, IronClaw, Hermes, OpenAI-compatible HTTP Callback tokens, request IDs, session keys, attachment limits, response status Duplicate/stale callbacks, partial responses, callback races, malformed attachments, HTML error pages Generate malformed callbacks and agent responses; assert reject/drop without waking wrong request Simulate valid callbacks and responses across ordering/latency variants; assert one reply maps to the correct pending request
Process adapters: CLI, artifact CLI, Codex/Claude/Kimi/Dirac, ACP/ACPX Command path, argv/env templates, stdin protocol, timeout budget, artifact root Non-UTF8 stdout/stderr, broken stdin, hung child, partial JSON-RPC frames, symlink artifact trees, huge outputs PTY/subprocess simulator with random stdout/stderr/exit/timing; artifact filesystem fuzz Valid protocol frame generators and golden CLI transcript corpora; assert normalized agent output/artifacts
Channel adapters: Telegram, Matrix, Signal, WhatsApp, SMS/Linq, mock Sender identity, auth/HMAC, event type, message text, group/reply target, timestamp Duplicate/out-of-order events, missing reply target, non-text attachments, confusable IDs, provider rate-limit bodies Generate malformed webhooks/updates/events and normalized ChannelMessage variants; assert unauthorized or unusable input drops safely Golden corpora and structured generators for valid channel events; assert correct identity, routing, reply target, command state, and outbound response
Config, identity, auth, routing, model selectors TOML/JSON schema, unique IDs, alias format, route graph, allowlists, context-window declarations Duplicate/cyclic aliases, numeric overflows, unknown legacy fields, unsafe globs, huge configs Fuzz TOML/JSON loaders and validators; assert typed errors and no partial unsafe config Generate valid graph variations; assert same auth/routing decisions across aliases, shortcuts, roles, alloys, cascades, and dispatchers
Security proxy, adversary detector, scanner policy URL policy, secret reference grammar, scanner result schema, upstream method/header/body contracts Header smuggling, compressed/chunked/binary bodies, redirects, hostile HTML/JSON, malformed Starlark returns Fuzz URL/header/body substitution, response scanning, policy loaders; assert block/allow decisions fail closed where required Replay valid browser/provider traffic through mock upstreams; assert secrets substitute only for allowed destinations and blocked content never reaches model context
Secrets: fnox, MCP server, paste server, metadata, .env ingestion Secret names, destination metadata, token expiry, Origin/Referer policy, MCP params Missing/hung/non-UTF8 fnox, replayed paste tokens, malformed .env, racey submissions Fake fnox process and HTTP fuzz for paste/MCP; assert no secret leak and correct auth failure Generate valid secret refs, metadata, and .env entries; assert correct vault operation and destination policy
Filesystem and persistence Config paths, artifact roots, context DB schema, cache/log locations, package layout Symlinks, permission errors, missing dirs, corrupt SQLite, concurrent writes, stale service files Filesystem simulator/tempdir chaos; corrupt DB/config files; assert no data loss outside owned paths Valid path/layout generators for source/Homebrew/Docker/user/system installs; assert doctor/runtime use the same files
Installer, doctor, service managers, local model lifecycle, hooks OS/service template schema, SSH target, remote config path, hook env vars, switch timeout Stale launchd/systemd units, missing binaries, SSH partial failures, hook hangs, concurrent switch storms Mock SSH/service/process faults; assert rollback or explicit degraded result Generate valid install/update layouts and hook envs; assert doctor checks the same path the user action will use
Host-agent RPC, approval, ZFS/git/pct/systemd adapters mTLS identity, operation schema, approval token TTL, dataset/snapshot naming Spoofed webhook, replayed approval, path/shell metacharacters, RPC JSON drift, adapter rate-limit variants Generate malformed RPC/approval payloads; assert auth rejection and no host mutation Valid operation sequence generators and mock host adapters; assert approved operations mutate only intended resources
Clashd/domain lists/policy hooks Hook payload schema, domain grammar, policy result shape, remote list format Huge lists, redirects, punycode/wildcards, malformed hooks, policy recursion/timeout Fuzz domain normalization and policy payloads; assert deny/fail-closed rules hold Replay valid hook/domain-list corpora; assert deterministic policy decisions
Time, concurrency, retries, and background state TTLs, retry budgets, progress cadence, request correlation, cancellation semantics Clock jumps, duplicate delivery, retry storms, partial cancellation, stale numbered-choice state Loom/deterministic simulation/fault injection for shared state and retry paths Valid operation sequences with randomized timing; assert exactly-once or at-most-once semantics where promised

1. Scenario catalog before broad test growth

Keep a small checked-in catalog of high-risk product scenarios. Each scenario must name:

Examples:

Every new adapter, provider, channel, or installer path should add or update at least one scenario.

The first catalog lives at tests/scenarios/high-risk-scenarios.json and is validated by scripts/check-scenarios.py in CI. It is intentionally not a marketing roadmap. It is a list of assumptions we expect future tests and live smokes to attack.

The integration-boundary registry lives at tests/boundaries/integration-surfaces.json and is validated by scripts/check-boundary-surfaces.py in CI. New files under the adapter, channel, proxy, install, security, secret, host-agent, and policy boundary directories must be registered there with:

This makes adding a new integration surface a visible test-design act instead of an implicit code-only change.

2. Contract tests at every external boundary

For each provider adapter and first-class agent adapter, keep tests that use wire-level fixtures from real services:

Mocks should imitate real captures, not idealized structs.

3. Differential smoke tests

For release candidates, send the same small prompt through:

The assertion is not that latency or wording matches. The assertion is that failures classify correctly: provider failure, Calciforge failure, agent failure, or channel failure.

4. Property, fuzz, and simulation tests where boundaries make decisions

Use property tests for typed invariants:

Use fuzzing where malformed bytes can cross a trust boundary:

The first cargo-fuzz harnesses live in fuzz/ and are intentionally scoped to library surfaces that can already be linked by libFuzzer:

scripts/boundary-fuzz.sh smoke runs short local fuzz bursts, and scripts/boundary-fuzz.sh nightly runs a longer fuzz sweep from the scheduled staging workflow. Calciforge runtime adapters are still mostly inside the binary crate, so putting their exact parsers under cargo-fuzz will require extracting small boundary modules into linkable library APIs instead of only testing them through unit-test modules.

For deeper exploration outside PR CI, use scripts/boundary-explore-long.sh one-hour all or scope it to a boundary such as gateway, agents, security, secrets, clashd, or install. The script cycles high-case property tests and fuzz targets until the wall-clock budget expires, writing logs under boundary-artifacts/. A one-hour run is the minimum useful local sweep; the same runner is intended to support 24-72 hour campaigns after more boundary generators are added.

Patterns to copy from mature property/fuzz ecosystems:

Use simulation or chaos where the boundary is temporal or stateful:

These tests should assert invariants, not snapshots. Examples:

5. Mutation tests on the small set of critical modules

Run mutation tests selectively. Whole-workspace mutation testing is too slow and too noisy right now.

Start with:

If a mutation survives in one of these modules, either improve the test or decide that the branch is dead code and remove it.

6. Doctor must test the path users actually take

calciforge doctor should not stop at “port is open” or “config parses.” For first-class support, it should execute the same high-level path the user will use:

When a check would be expensive, doctor should mark it as a skipped live check with the exact command needed to run it.

7. Read the upstream docs when behavior depends on them

Before shipping an adapter or changing a protocol path, capture the upstream contract in the PR:

This is not a paperwork exercise. It is how we avoid learning basic protocol facts from a user’s failed test message.

Near-term Work

  1. Add real-shape fixtures for streaming chat completions and tool calls.
  2. Create a checked-in boundary corpus for provider, channel, agent callback, config, filesystem, and subprocess fixtures.
  3. Add per-boundary two-tier tests: invalid containment first, valid generalized correctness second.
  4. Extend provider-adapter tests to cover non-retryable failures and alias resolution through the same code used at runtime.
  5. Add a doctor --live path for first-class agent smoke tests.
  6. Run a tiny mutation pass on selector resolution and security-proxy policy.
  7. Add a release-candidate checklist item: one manually observed failure must become either an automated regression test or a documented impossible-to-test gap before the PR merges.
  8. Tie new aggression tests back to an explicit contract, scenario, or invariant from the AI-assisted engineering discipline page, so test growth stays tied to the promises Calciforge actually makes.

Test Quality Standard

A useful regression test should answer three questions:

If the answer to the first question is “no,” the test may still be useful, but it is not a regression test. Label it honestly.

For AI-assisted changes, the same rule applies one level higher: a generated test only earns trust when a human can name the contract it protects and the failure it would catch. Count fewer, sharper tests before counting files or lines.