Status: Design sketch
Calciforge already has a lot of tests. Recent staging bugs show that count is not the same thing as confidence. The recurring problem is narrower: tests often cover one observed shape, while production breaks on another valid shape emitted by a real agent, gateway, package manager, filesystem, service manager, or channel.
This page records the lesson and turns it into work. The aim is not only more regression tests. Calciforge needs aggression tests: checks that deliberately break protocols, runtime state, config paths, and trust assumptions in ways we can already foresee.
OpenClaw sent a normal chat-completions request through Calciforge with
stream=true and tools enabled. Helicone returned a valid
text/event-stream response. Calciforge treated the upstream body as a single
JSON chat-completion object and failed to decode it.
From the user’s view, the local OpenClaw agent timed out, retried, and then looked wedged. The services were mostly alive; the contract between two live components was wrong.
stream=true.Calciforge now accepts upstream text/event-stream chat-completion responses
from Helicone and folds them into the existing internal ChatCompletionResponse
type. Regression tests cover streamed content and streamed tool-call argument
chunks while preserving stream=true in the outbound request.
This is a compatibility fix, not the final streaming design. Today the adapter still aggregates the upstream stream before Calciforge emits its response. True token-through streaming needs a wider gateway trait and handler change.
Recent failures tend to fall into a few buckets:
doctor can pass while the next real user action
fails because the check did not exercise the same path.Every surface that accepts data, bytes, process output, filesystem state, time, or network behavior Calciforge does not control must declare two contracts:
This is deliberately not a test of a single client’s behavior. The unit under test is the boundary contract. A test that only proves one mocked client response still works is not enough for a boundary.
For each boundary, classify behavior before writing tests:
The first category gets table-driven regressions. The second gets fuzzing and chaos. The third gets golden fixtures and structured generators. The fourth gets property tests, differential checks, and integration simulators.
Every boundary must eventually have both tiers:
Property tests such as Hegel/proptest are one layer. They do not replace raw
byte fuzzing, real fixture corpora, mock upstream servers, subprocess
simulators, filesystem chaos, or deterministic and fault-injected integration
tests. Use Hegel where the input space is typed and the invariant is crisp; use
cargo-fuzz/arbitrary where bytes cross a parser boundary; use simulators
where ordering, timing, retries, and partial failure matter.
Property quality is the gating issue. A generated test only earns its keep when the generator covers the boundary’s real input grammar and the oracle checks the boundary’s contract instead of a private implementation detail. Every boundary test should state which layer it belongs to:
arbitrary/libFuzzer-generated valid-ish domain
values. The oracle must check normalization and correctness across many valid
shapes, such as duplicate fields, legal extra fields, interleaved tool-call
chunks, alias graphs, or alternate channel event forms.For each external boundary, the invalid generator and valid generator are different artifacts. Invalid-data tests should aggressively cover nonsense and near-miss inputs while proving containment. Valid-data tests should generate the entire legal shape space and prove the user-visible result is correct.
| Surface | Bounded behavior | Unbounded behavior | Tier 1 invalid containment | Tier 2 valid correctness |
|---|---|---|---|---|
| OpenAI-compatible model gateway and provider adapters | Request schema, auth headers, model selectors, content type, provider status class | SSE frame boundaries, extra fields, provider-specific error bodies, huge/non-UTF8 bodies, transport timing | Fuzz JSON/SSE parsers, provider error envelopes, headers, timeout/read errors | Generate valid JSON/SSE/tool-call streams, replay provider corpora, assert normalized ChatCompletionResponse and failure classification |
| Agent adapters: OpenClaw, ZeroClaw, IronClaw, Hermes, OpenAI-compatible HTTP | Callback tokens, request IDs, session keys, attachment limits, response status | Duplicate/stale callbacks, partial responses, callback races, malformed attachments, HTML error pages | Generate malformed callbacks and agent responses; assert reject/drop without waking wrong request | Simulate valid callbacks and responses across ordering/latency variants; assert one reply maps to the correct pending request |
| Process adapters: CLI, artifact CLI, Codex/Claude/Kimi/Dirac, ACP/ACPX | Command path, argv/env templates, stdin protocol, timeout budget, artifact root | Non-UTF8 stdout/stderr, broken stdin, hung child, partial JSON-RPC frames, symlink artifact trees, huge outputs | PTY/subprocess simulator with random stdout/stderr/exit/timing; artifact filesystem fuzz | Valid protocol frame generators and golden CLI transcript corpora; assert normalized agent output/artifacts |
| Channel adapters: Telegram, Matrix, Signal, WhatsApp, SMS/Linq, mock | Sender identity, auth/HMAC, event type, message text, group/reply target, timestamp | Duplicate/out-of-order events, missing reply target, non-text attachments, confusable IDs, provider rate-limit bodies | Generate malformed webhooks/updates/events and normalized ChannelMessage variants; assert unauthorized or unusable input drops safely |
Golden corpora and structured generators for valid channel events; assert correct identity, routing, reply target, command state, and outbound response |
| Config, identity, auth, routing, model selectors | TOML/JSON schema, unique IDs, alias format, route graph, allowlists, context-window declarations | Duplicate/cyclic aliases, numeric overflows, unknown legacy fields, unsafe globs, huge configs | Fuzz TOML/JSON loaders and validators; assert typed errors and no partial unsafe config | Generate valid graph variations; assert same auth/routing decisions across aliases, shortcuts, roles, alloys, cascades, and dispatchers |
| Security proxy, adversary detector, scanner policy | URL policy, secret reference grammar, scanner result schema, upstream method/header/body contracts | Header smuggling, compressed/chunked/binary bodies, redirects, hostile HTML/JSON, malformed Starlark returns | Fuzz URL/header/body substitution, response scanning, policy loaders; assert block/allow decisions fail closed where required | Replay valid browser/provider traffic through mock upstreams; assert secrets substitute only for allowed destinations and blocked content never reaches model context |
Secrets: fnox, MCP server, paste server, metadata, .env ingestion |
Secret names, destination metadata, token expiry, Origin/Referer policy, MCP params | Missing/hung/non-UTF8 fnox, replayed paste tokens, malformed .env, racey submissions |
Fake fnox process and HTTP fuzz for paste/MCP; assert no secret leak and correct auth failure | Generate valid secret refs, metadata, and .env entries; assert correct vault operation and destination policy |
| Filesystem and persistence | Config paths, artifact roots, context DB schema, cache/log locations, package layout | Symlinks, permission errors, missing dirs, corrupt SQLite, concurrent writes, stale service files | Filesystem simulator/tempdir chaos; corrupt DB/config files; assert no data loss outside owned paths | Valid path/layout generators for source/Homebrew/Docker/user/system installs; assert doctor/runtime use the same files |
| Installer, doctor, service managers, local model lifecycle, hooks | OS/service template schema, SSH target, remote config path, hook env vars, switch timeout | Stale launchd/systemd units, missing binaries, SSH partial failures, hook hangs, concurrent switch storms | Mock SSH/service/process faults; assert rollback or explicit degraded result | Generate valid install/update layouts and hook envs; assert doctor checks the same path the user action will use |
| Host-agent RPC, approval, ZFS/git/pct/systemd adapters | mTLS identity, operation schema, approval token TTL, dataset/snapshot naming | Spoofed webhook, replayed approval, path/shell metacharacters, RPC JSON drift, adapter rate-limit variants | Generate malformed RPC/approval payloads; assert auth rejection and no host mutation | Valid operation sequence generators and mock host adapters; assert approved operations mutate only intended resources |
| Clashd/domain lists/policy hooks | Hook payload schema, domain grammar, policy result shape, remote list format | Huge lists, redirects, punycode/wildcards, malformed hooks, policy recursion/timeout | Fuzz domain normalization and policy payloads; assert deny/fail-closed rules hold | Replay valid hook/domain-list corpora; assert deterministic policy decisions |
| Time, concurrency, retries, and background state | TTLs, retry budgets, progress cadence, request correlation, cancellation semantics | Clock jumps, duplicate delivery, retry storms, partial cancellation, stale numbered-choice state | Loom/deterministic simulation/fault injection for shared state and retry paths | Valid operation sequences with randomized timing; assert exactly-once or at-most-once semantics where promised |
Keep a small checked-in catalog of high-risk product scenarios. Each scenario must name:
Examples:
stream=true with tool calls through the configured
model provider.”doctor validates.”Every new adapter, provider, channel, or installer path should add or update at least one scenario.
The first catalog lives at tests/scenarios/high-risk-scenarios.json and is
validated by scripts/check-scenarios.py in CI. It is intentionally not a
marketing roadmap. It is a list of assumptions we expect future tests and live
smokes to attack.
The integration-boundary registry lives at
tests/boundaries/integration-surfaces.json and is validated by
scripts/check-boundary-surfaces.py in CI. New files under the adapter,
channel, proxy, install, security, secret, host-agent, and policy boundary
directories must be registered there with:
missing or partial,This makes adding a new integration surface a visible test-design act instead of an implicit code-only change.
For each provider adapter and first-class agent adapter, keep tests that use wire-level fixtures from real services:
Mocks should imitate real captures, not idealized structs.
For release candidates, send the same small prompt through:
The assertion is not that latency or wording matches. The assertion is that failures classify correctly: provider failure, Calciforge failure, agent failure, or channel failure.
Use property tests for typed invariants:
Use fuzzing where malformed bytes can cross a trust boundary:
The first cargo-fuzz harnesses live in fuzz/ and are intentionally scoped to
library surfaces that can already be linked by libFuzzer:
security_substitution_bytes fuzzes arbitrary text crossing the secret
reference and placeholder substitution boundary.security_substitution_valid_refs uses structured fuzzing to generate valid
secret-reference documents and asserts exact rendered output.secret_metadata_destinations fuzzes destination-policy input normalization.clashd_domain_lists fuzzes domain-list parsing and matcher construction.scripts/boundary-fuzz.sh smoke runs short local fuzz bursts, and
scripts/boundary-fuzz.sh nightly runs a longer fuzz sweep from the scheduled
staging workflow. Calciforge runtime adapters are still mostly inside the binary
crate, so putting their exact parsers under cargo-fuzz will require extracting
small boundary modules into linkable library APIs instead of only testing them
through unit-test modules.
For deeper exploration outside PR CI, use
scripts/boundary-explore-long.sh one-hour all or scope it to a boundary such
as gateway, agents, security, secrets, clashd, or install. The
script cycles high-case property tests and fuzz targets until the wall-clock
budget expires, writing logs under boundary-artifacts/. A one-hour run is the
minimum useful local sweep; the same runner is intended to support 24-72 hour
campaigns after more boundary generators are added.
Patterns to copy from mature property/fuzz ecosystems:
cargo-fuzz parser harnesses should start from existing example or unit-test
code, move file I/O into memory, and treat parser Err as acceptable
containment for invalid inputs.arbitrary should be used when raw fuzzer bytes need to become structured
legal-ish domain values. This is how byte fuzzing graduates from “random
strings” into realistic model/channel/config shapes.Use simulation or chaos where the boundary is temporal or stateful:
These tests should assert invariants, not snapshots. Examples:
Run mutation tests selectively. Whole-workspace mutation testing is too slow and too noisy right now.
Start with:
If a mutation survives in one of these modules, either improve the test or decide that the branch is dead code and remove it.
calciforge doctor should not stop at “port is open” or “config parses.” For
first-class support, it should execute the same high-level path the user will
use:
When a check would be expensive, doctor should mark it as a skipped live check with the exact command needed to run it.
Before shipping an adapter or changing a protocol path, capture the upstream contract in the PR:
This is not a paperwork exercise. It is how we avoid learning basic protocol facts from a user’s failed test message.
doctor --live path for first-class agent smoke tests.A useful regression test should answer three questions:
If the answer to the first question is “no,” the test may still be useful, but it is not a regression test. Label it honestly.
For AI-assisted changes, the same rule applies one level higher: a generated test only earns trust when a human can name the contract it protects and the failure it would catch. Count fewer, sharper tests before counting files or lines.