Status: implemented for the core runtime. Alloy context-window
safety, named [[cascades]], named [[dispatchers]], and the shared
estimator trait are in code. char_ratio, byte_ratio, and optional
tiktoken-rs estimators are implemented through [proxy.token_estimator].
capacity_fraction and per-model/per-primitive estimator overrides remain
future work.
| # | Question | Resolution |
|---|---|---|
| 1 | Name for the size-routing primitive | dispatcher (“router” too generic) |
| 2 | Cascade as a named primitive | Yes — own [[cascades]] table |
| 3 | Safety margin default | Two knobs — estimator safety_margin (default 1.10) AND per-model capacity_fraction (default 1.0; users lower to e.g. 0.85 when a model degrades near its ceiling). Composition formula and rationale in the updated section below. |
| 4 | Per-primitive tokenizer override + second tokenizer impl | Global estimator config ships first. CharRatioEstimator, ByteRatioEstimator, and optional TiktokenEstimator are wired into routing. SentencePieceEstimator and per-model/per-primitive overrides are deferred. |
| 5 | Re-evaluation default for dispatchers | per_turn (re-evaluate each message — never dies from size). sticky as opt-in for flows where model-voice continuity matters. sticky_escalate as a middle-ground convenience (sticky, permit one auto-promotion on ceiling, then sticky at the new tier). worst_case advanced opt-in with required growth prior. |
| 6 | Back-compat: allow missing context_window on alloy constituents |
No — required field. Prototype phase, all installations owned in-house. Forcing size declaration at config load prevents silent truncation forever; trivial one-time config edit. |
| 7 | Dispatcher rule semantics + capacity_fraction interaction | Default: “first target whose effective ceiling fits the request.” No max_input_tokens thresholds needed for the common case. capacity_fraction lives on each model individually and feeds the effective-ceiling computation. Explicit when.max_input_tokens rules remain available for non-size routing (cost tier, agent-id, etc.). |
Today calciforge has one model-blending primitive — alloy — and an implicit on-error fallback behavior. That’s not enough:
This RFC proposes:
[[alloys]] — blend between equivalent models (implemented, with context-window safety)[[cascades]] — try in order, fall through on error (implemented as a named primitive)[[dispatchers]] — pick by request shape (implemented for size-first routing)TokenEstimator trait used by all three primitives to reason about whether a request “fits” a model. Default: configurable chars-per-token heuristic. Pluggable: real tokenizers (tiktoken, sentencepiece).min_context_window safety assertion at alloy-build time, so silent-truncation footguns fail loudly.The user wrote recently:
I think we also want an alloy that hybridizes local and kimi 2.6 models to see if we can best of both and avoid hitting limits on kimi while still leveraging local compute and hopefully getting results nearly as good as kimi… though not sure how context window sizes will work in alloys, maybe we never addressed that?
This is a real use case — “most requests are small and fit local; occasional large ones need Kimi” — but forcing it into an alloy is a dead-end. Alloys do random weighted sampling between constituents. That’s meaningful when constituents are ~equivalent and the blend expresses a cost/quality preference. It breaks when one constituent can’t serve certain requests at all.
The real abstraction the user wants is: “choose smallest-sufficient model per request.” That’s not sampling — it’s routing.
Cascade picks primary, falls through to secondary on failure. If primary is 200K-context Claude and secondary is 32K Qwen, a 100K request that makes it past the primary (say Claude is rate-limited at second call) falls through to Qwen and silently loses 70% of the context.
All three primitives need to respect context-window math.
Purpose: cost/quality blending via sampling. 80% fast + 20% smart = blended average.
Assumption: constituents are interchangeable. New requirement: they must have compatible context windows.
New config fields:
[[alloys]]
id = "fast-smart-blend"
name = "Fast + Smart Blend"
strategy = "weighted"
# Effective ceiling for the alloy. If not set, auto-computed as min(constituents.context_window).
# Requests above this ceiling are rejected at alloy level (loudly, not silently).
min_context_window = 200000
[[alloys.constituents]]
model = "gemini-2.5-flash"
context_window = 1048576
weight = 80
[[alloys.constituents]]
model = "claude-haiku-4-6"
context_window = 200000
weight = 20
Validation: at AlloyProvider::from_config(), error if any constituent’s declared context_window < min_context_window. Catches the “I didn’t mean to put a 32K and a 262K in the same alloy” footgun at config-load time.
Runtime check: when request arrives, estimate tokens (via TokenEstimator) and reject with clear error if estimate > min_context_window. Never truncate silently.
Purpose: reliability. Try primary, on timeout/5xx/429 try secondary.
Today’s behavior: there is no fallbacks field on AlloyConfig. Fallback
is implicit — AlloyProvider::select_plan() returns an ordered_models: Vec<String>
listing every constituent as a potential fallback, and the proxy iterates them
in route_with_fallback() until one succeeds. Order within that list is
deterministic for round_robin (rotating from the last selected index)
but varies per request for weighted (weighted sampling without
replacement). That “all constituents of the alloy are also fallbacks” pattern
is what the cascade primitive promotes to its own named construct, with
the important distinction that cascade ordering is always deterministic
(declaration order):
[[cascades]]
id = "kimi-with-fallback"
# First success wins.
#
# Cascade is TRIGGERED by errors (timeout, 5xx, 429) — it does not treat
# "request too large" as a retry condition. But before each step is
# attempted, the runtime pre-checks that the request fits that step's
# context_window and SKIPS unfit steps (with a warning log) rather than
# letting the model return an error. Think of it as: ineligibility is
# cheap to detect up front, so we do; actual errors are what cascade
# retries exist for.
[[cascades.steps]]
model = "opencode-go/kimi-k2.6"
context_window = 262144
[[cascades.steps]]
model = "kimi-for-coding" # Moonshot
context_window = 128000
[[cascades.steps]]
model = "local/qwen3.5-35b" # last resort, much smaller
context_window = 32768
Runtime behavior: before trying step N, estimate request tokens; skip to step N+1 if request doesn’t fit. Track which steps were attempted for telemetry. Fail with clear error if no step can serve.
Discussion open: should cascade skip-on-size be silent, or emit a warning per downgrade? Recommendation: warning log at INFO level per skipped step; final error if everything skipped.
Purpose: route requests to the smallest-sufficient model (or, future: by other properties).
Default behavior: ordered list of targets; first target whose effective ceiling can hold the request wins. No thresholds to maintain — the size check uses each target’s own declared context_window × capacity_fraction.
[[dispatchers]]
id = "kimi-smart"
reevaluate = "per_turn"
# Try in order. First target whose effective ceiling fits the request wins.
# Error if no target fits.
targets = [
"local/qwen3.5-35b", # effective ≈ 24,576
"opencode-go/kimi-k2.6", # effective ≈ 222,822
"gemini-2.5-flash", # effective ≈ 996,147
]
The effective ceiling is context_window × capacity_fraction, computed per-model. Adding or removing a model from the list doesn’t require re-computing thresholds — the model’s own declaration drives the fit check.
Targets can be models OR other primitives:
targets = [
"local/qwen3.5-35b",
"alloy/claude-gemini-200k", # for requests that fit the alloy's effective ceiling
"gemini-2.5-flash",
]
Explicit rules for non-size decisions (advanced — cost tier, agent id, request content, time of day):
[[dispatchers]]
id = "cost-aware"
# When explicit rules are present, they override the default fit-first-target behavior.
# Rules evaluated in declared order; first match wins.
[[dispatchers.rules]]
when.max_input_tokens = 10000 # hand-set floor, tighter than capacity_fraction would imply
target = "cheap-model"
[[dispatchers.rules]]
fits_target = true # fall back to implicit fit check against the target's effective ceiling
target = "expensive-model"
This composition is how the “kimi + local hybrid” goal is expressed safely:
“Router” is overloaded in software (content routing, URL routing, network routing). “Dispatcher” is more specific: picks which backend handles this request.
Alternatives considered:
| Name | Pro | Con |
|---|---|---|
router |
Familiar | Too generic; “routing” overloaded |
dispatcher ⭐ |
Clear “pick target per request” semantics | A little programmer-jargony |
tier / tiers |
Captures size-laddering | Presumes size is the only axis |
fit / fitter |
Short, clear for size case | Obscure for non-size rules |
selector |
Generic | Also used in k8s/CSS; overloaded |
picker |
Folksy, clear | Informal; unconventional |
bucket-router |
Explicit | Compound, awkward |
Going with dispatcher pending feedback.
TokenEstimator trait (shared)All three primitives need to answer: “does this request’s input fit in model X’s context window?” That requires a token estimate. We want:
/// Estimate the token count of a prompt for the purpose of fit-checking.
///
/// Implementations SHOULD be conservative (over-estimate slightly) so that
/// fit-checks have headroom — under-estimation is a silent-truncation risk,
/// over-estimation at worst forces a fallthrough to a bigger model.
pub trait TokenEstimator: Send + Sync {
/// Estimate tokens for a plain-text prompt (excludes tool definitions).
fn estimate_text(&self, text: &str) -> usize;
/// Estimate tokens for a chat request (messages + optional tool definitions).
/// Default impl sums per-message + fixed overhead; override for accuracy.
fn estimate_chat(&self, messages: &[Message], tools: &[ToolDef]) -> usize {
// naive default: sum of text estimates + per-message framing overhead
let msg_tokens: usize = messages.iter().map(|m| self.estimate_text(&m.content)).sum();
let tool_tokens: usize = tools.iter().map(|t| self.estimate_text(&t.schema_json)).sum();
let framing = messages.len() * 4; // role markers, separators
msg_tokens + tool_tokens + framing
}
}
CharRatioEstimatorpub struct CharRatioEstimator {
pub chars_per_token: f32, // default 3.5 (English-prose-biased)
pub safety_margin: f32, // default 1.10 (overstate by 10%)
}
impl Default for CharRatioEstimator {
fn default() -> Self {
Self { chars_per_token: 3.5, safety_margin: 1.10 }
}
}
impl TokenEstimator for CharRatioEstimator {
fn estimate_text(&self, text: &str) -> usize {
let chars = text.chars().count() as f32;
(chars / self.chars_per_token * self.safety_margin).ceil() as usize
}
}
Rationale for default values:
chars_per_token to a denser ratio (e.g., 1.8), (b) raise safety_margin substantially (2.0+), or (c) use the Tiktoken estimator (feature flag) where an exact BPE count eliminates the guesswork. The CharRatio defaults are safe for the English-first deployments this RFC targets; anything heavier should tune.Both fields are configurable, see “Config surface” below.
The original draft conflated two things. They’re separate concerns with different defaults and scopes:
Knob A — estimator safety_margin (multiplier on the estimate):
“I might under-count tokens because my heuristic is approximate.”
1.10 for char-ratio; a real tokenizer like tiktoken can use 1.02 since it’s accurate to ~1%Knob B — model capacity_fraction (multiplier on the declared window):
“Even if I knew the exact count, some models degrade near their ceiling. Don’t push them there.”
context_window1.0 (use the full declared window). Users lower it per-model when they see quality drop-off.capacity_fraction = 0.7 — clean separation from the estimator concern.Fit-check composition:
Convention: TokenEstimator::estimate_* returns a conservative count
with safety_margin already applied (see the CharRatioEstimator::estimate_text
impl above — the * self.safety_margin happens inside). Callers never
multiply by safety_margin a second time. The per-model capacity_fraction
is applied once, on the declared context_window, to derive an “effective
ceiling”. The fit check is then a direct comparison:
estimate = TokenEstimator::estimate_*(...) // already margin-applied
ceiling = model.context_window * model.capacity_fraction
Rejected if: estimate > ceiling
Worked example (prose so the formula stays the single source of truth):
capacity_fraction to 0.85, giving an effective ceiling near
222,822.Dispatcher rule language uses “effective ceiling” to mean
context_window × capacity_fraction consistently.
Config:
[proxy.token_estimator]
strategy = "auto" # auto, char_ratio, byte_ratio, or tiktoken
# tokenizer = "o200k_base" # optional tiktoken base override
chars_per_token = 3.5
safety_margin = 1.10 # estimator knob
[[models]]
id = "kimi-k2.6"
context_window = 262144
capacity_fraction = 0.85 # avoid top 15% where Kimi reportedly degrades
[[models]]
id = "claude-sonnet-4-6"
context_window = 200000
capacity_fraction = 0.95 # Claude holds up closer to ceiling
[[models]]
id = "local/qwen3.5-35b"
context_window = 32768
capacity_fraction = 0.75 # user has observed noticeable drop past 24K
// OpenAI-compatible BPE count via optional `tiktoken-rs` feature.
pub struct TiktokenEstimator {
bpe: &'static tiktoken_rs::CoreBPE,
}
// SentencePiece for Llama-family models
pub struct SentencePieceEstimator { /* ... */ }
Users opt in by configuring a non-default estimator. We ship
CharRatioEstimator and ByteRatioEstimator in the default build; the
OpenAI-compatible tokenizer is gated behind
--features tiktoken-estimator to keep default build dependencies light.
Different models have wildly different token/char ratios:
| Model family | Rough chars/token |
|---|---|
| GPT-4 (English prose) | ~4 |
| GPT-4 (code) | ~2.5 |
| Claude | ~3.7 |
| Llama/Qwen | ~3.0 |
| Chinese text | ~1.5 |
So chars_per_token should be overridable per-model:
[model_defaults]
chars_per_token = 3.5
safety_margin = 1.10
[[models]]
id = "qwen3.5-35b"
context_window = 32768
chars_per_token = 3.0 # Qwen tokenizer tends denser
[[models]]
id = "kimi-k2.6"
context_window = 262144
chars_per_token = 2.8 # Chinese-English mixed, code-heavy
Note: the
[tokenizer],[model_defaults], and[[models]]sections below are proposed additions toCalciforgeConfig. They do not exist in the current schema and will be added as part of the implementation of this RFC. A schema version bump is expected; existing configs stay valid without them (resolution falls through to built-in defaults).
Global default in top-level config:
[tokenizer]
kind = "char_ratio" # "char_ratio" | "tiktoken" | "sentencepiece"
chars_per_token = 3.5
safety_margin = 1.10
Per-primitive override (in an alloy / cascade / dispatcher):
[[dispatchers]]
id = "smart"
[dispatchers.tokenizer]
kind = "tiktoken"
encoding = "cl100k_base"
Per-model override: as shown above. Takes precedence over per-primitive and global.
Resolution order: per-model > per-primitive > global > built-in default.
The primitives are composable. Common patterns:
Pattern 1: size-tier that blends within tiers
[[alloys]]
id = "claude-gemini-200k"
min_context_window = 200000
# … 200K-safe blend
[[dispatchers]]
id = "smart"
[[dispatchers.rules]]
when.max_input_tokens = 30000
target = "local/qwen3.5-35b"
[[dispatchers.rules]]
when.max_input_tokens = 180000
target = "alloy/claude-gemini-200k" # small-enough for our 200K blend
[[dispatchers.rules]]
when.max_input_tokens = 900000
target = "gemini-2.5-flash"
Pattern 2: dispatcher in front of cascade
[[cascades]]
id = "kimi-or-fallback"
# Assumes caller fits within narrowest member — paired with a dispatcher for safety
[[cascades.steps]]
model = "opencode-go/kimi-k2.6"
[[cascades.steps]]
model = "kimi-for-coding"
[[dispatchers]]
id = "with-safety"
[[dispatchers.rules]]
when.max_input_tokens = 125000 # narrowest cascade member is 128K Moonshot
target = "cascade/kimi-or-fallback"
[[dispatchers.rules]]
when.max_input_tokens = 250000
target = "opencode-go/kimi-k2.6" # direct, Moonshot can't fit
Rule: cascades are not size-safe on their own — they must be used at a level where all members can serve the incoming request, OR wrapped in a dispatcher.
A dispatcher picks at request time. By message 20, the cumulative context may have grown past the initially-chosen model’s ceiling.
Options:
Decision: default to per_turn. Chat APIs are stateless; re-picking per message mechanically works. For task-completion flows (calciforge’s main use case) the cost of an occasional model swap is lower than the cost of a session that dies at a ceiling.
[[dispatchers]]
id = "smart"
reevaluate = "per_turn" # default — re-pick each message
# reevaluate = "sticky" # pick once, error on ceiling (for voice-continuity flows)
# reevaluate = "sticky_escalate" # sticky, auto-promote once on ceiling, then sticky at new tier
# reevaluate = "worst_case" # advanced — requires growth prior below
# assume_session_max_tokens = 100000
Pragmatic default for most flows: sticky_escalate is arguably the sweet spot (stability most of the session, graceful upgrade when needed). Listed as an opt-in for now since per_turn is simplest and always works.
Function definitions + tool results add tokens not present in the user’s message. A list_files tool definition is ~100 tokens; a directory listing result can be thousands.
Addressed by the estimate_chat method taking tools explicitly, and by default safety margin. Power users with tool-heavy flows should bump safety_margin (a multiplier — try 1.15 or 1.20, i.e. +15–20%, rather than the default 1.10).
Correction to an earlier draft: for most providers the context window bounds input_tokens + output_tokens, not just input. A request that fits on input can still overflow at generation time if max_tokens is large. Our TokenEstimator measures input only, so the fit-check must reserve headroom for the output budget. Implementation detail: primitive runtime compares estimate(input) + max_tokens_for_request against effective_ceiling, not just estimate(input). Callers who don’t set max_tokens explicitly must supply a default output budget (e.g., 4K) so the check isn’t silently bypassed.
Reasoning tokens add output cost but are usually not in the input context. Estimator doesn’t need to account for them.
All sizes are in tokens. Not characters, not bytes. Config authors can use K suffix for readability; K means * 1024 (binary convention):
context_window = 262144 # tokens (2^18)
# equivalently:
context_window = "256K" # parsed as 256 * 1024 = 262144
A literal “262K” would parse as 262 * 1024 = 268288, not 262144 — use "256K" if you want the 262144 value.
context_window (validated at AlloyProvider::from_config, and 0 is rejected explicitly). No silent-truncation path.Alloy constituents now REQUIRE context_window. No back-compat for missing fields. Prototype phase, all installations owned in-house — fixing existing config files is a one-time edit; the upside (no silent truncation, ever) is worth breaking the schema. min_context_window on the alloy stays optional and auto-computes as min(constituent.context_window) when not specified.
Existing config files needing updates:
/etc/calciforge/config.toml using [[alloys]][[alloys]]Migration done in the same PR that introduces the required field.
Cascades today are implicit inside alloy’s fallbacks behavior. This RFC promotes them to a named [[cascades]] primitive. Transition: existing alloy-with-fallbacks behavior becomes sugar for alloy-wrapped-in-cascade; we keep the sugar so existing configs don’t break.
Dispatchers are new. Opt-in.
TokenEstimator is implemented. Added as a global config under
[proxy.token_estimator]. If unconfigured, strategy = "auto" uses
tiktoken-rs when compiled in and recognized, otherwise the char-ratio
fallback.
In-scope:
Out of scope (follow-up work):
tiktoken-rs or other real tokenizers (separate PR; trait is there, default is enough for now)docs/model-gateway.md when implementation landsAll initial review questions have been resolved. Remaining open items surface during implementation:
capacity_fraction defaults per known model family — need empirical data. Initial defaults: 1.0 (no derate) until a model proves it needs less. Claude ~0.95, Kimi ~0.85 recommended starting points in docs, not hardcoded.cl100k_base vs o200k_base per model, or pick a sensible default? Open; likely per-model override.context_window to AlloyConstituentConfig, min_context_window to AlloyConfig, validation in AlloyProvider::new(). No runtime behavior change beyond rejecting bad configs at startup. (Already scoped as task #23.)TokenEstimator trait + CharRatioEstimator impl. Wire into alloy for runtime request-fit rejection.[[cascades]] named primitive with same fit-check semantics.[[dispatchers]] with max_input_tokens rules.docs/model-gateway.md authoritative reference.Each as a focused PR. RFC is the long-form design; PRs execute the plan.