One mechanism. Five names. Five models in agreement.
R7 established that attention-through-PDS works. R7.5 produced products. R7.6 picked the world-model MVP. R7.7 picked the train-a-model option. R8 answers what none of those rounds asked: mechanically, how does the substrate actually condition inference?
Five reasoning models, asked to pick exactly one mechanism from seven candidates. Five picked the same one — Attention-Mask Conditioning — and gave it five different names. The strongest panel convergence in any round of the R7-R8 family, at the architectural layer.
The 3.04× decode speedup is a tell that Modulum's existing software-only attention modification is already implicitly doing M5. R8 didn't choose M5; it formalized what was already there. PDS is not memory injected into context. PDS is a routing program over attention.
One mechanism, five names.
When five reasoning systems with different priors converge on the same architecture and give it five different names, that is the structure pushing through. The names diverge because each model carries the convergence into a different adjacent possible — see §06.
| Model | Name given | Distinctive framing | Pick |
|---|---|---|---|
| Claude | PCHR | PDS-Conditioned Head Routing. Discrete per-head, gated by PDS schema fingerprint, computed once per dispatch, held constant across the dispatch window. | M5 |
| Codex | MaskGate | Pre-attention compiler emitting head-mask + KV-block-mask + optional sparse bias. Treats route plans as compact, diffable, replayable bitsets. | M5 |
| Gemini | Modulum-SparseGate | "Dynamic sparsity engine." PDS as the runtime conductor of attention. Frames sparsity as the central exploitable resource, not a side effect. | M5 |
| Gemma | SAS · DHTS | Sparse-Attention-Substrate via Dynamic Head-Token Sparsification. Block-sparse Triton kernel with head-group + token-importance two-stage masking. | M5 |
| Grok | Domain-Specific Head Pruning | Foreshadows M5 as a deployable artifact per domain (Modulum-7B-Legal with 25% of heads). Treats M5 as the validation step before static pruning. | M5 |
What the panel agreed on, mechanically.
Five Section-2 implementation sketches converged on a single picture, with kernel-level details. Pre-attention, per-layer, per-head. Inside the self-attention module of each transformer block. Between Q/K/V projection and softmax. Inside a custom Triton kernel forking FlashAttention-3/4 or SGLang/vLLM paged attention.
head_mask[layer, head] → {0, 1, soft}
kv_block_mask[layer, head, block] → {0, 1}
fact_window_map[fact_id] → token_span | cached_span | virtual_span
head_fact_affinity[layer, head, fact_type] → float
attention_bias[layer, head, query_pos, kv_block] (optional)
For a 7B Llama-style model (32 layers × 32 heads), the head mask is 1024 bits. With KV cache paged into 16-token blocks at 4K context, each head needs 256 block decisions. Compact. Bitset-encodable. Cheap to pass into the attention kernel.
Three implementation phases
- Phase 1 — None (frozen base). Patched 7B base at runtime. PyTorch/Triton reference path. Sufficient for the R7.7 week-2 gate.
- Phase 2 — LoRA / route-aware adapter. Train with random substrate masks so the model becomes robust to conditional head sparsity.
- Phase 3 — Continued pretrain. Train with PDS route plans present during substrate tasks so heads specialize cleanly. Still M5, not M7 — the architecture is unchanged; the substrate compiler controls existing attention rather than adding a new layer.
Why M5 is over-determined.
All five models gave the same three-fact case against M1, M2, M3, M4, M6, and M7. M5 is the unique mechanism that explains the empirical anchors Hypernym already owns AND the unique mechanism that promotes (rather than replaces) the existing public API surface.
M1 ignores it. M2 uses it indirectly. M3 doesn't address it (caching, not architecture). M4 absorbs it eventually but adds work. M6 bypasses tokens but doesn't explain why attention had so much removable waste. M7 eventually integrates it but requires from-scratch training.
"Under PCHR it is the prerequisite. If attention were dense (10% noise), there would be no head budget to reallocate, the routing mask would be near-identity, and the mechanism would degenerate to baseline." — Claude
M1 cannot produce decode speedup (adds tokens). M2 adds latency. M3 speeds prefill, not decode. M4 adds work. M6/M7 may add work. M5 is the only mechanism that arithmetically predicts the observed speedup.
A mask zeroing 75% of head-token interactions yields ~4× theoretical attention speedup, partially offset by kernel overhead and variable head weight — landing at exactly 3.04×. The 3.04× is a tell that Modulum's existing software-only attention modification is already implicitly doing M5. R8 formalized what was already there.
Under M1, HyperRemember serializes facts into the prompt — a content provider. Under M5, HyperRemember stays the substrate query layer but its output becomes the input to a router, not text spliced into the prompt. The vocab-window and entity slots become a fingerprint vector; the router maps fingerprint → mask.
M5 promotes HyperRemember up the stack. The alternatives either leave it where it is (M1, M3) or replace it (M6, M7).
Crafter v1 as planned tests the wrong mechanism.
All five models flagged this in Section 1. The R7.6 architecture is HyperRemember pre-action query plus in-context substrate injection — that is the M1 baseline. Crafter v1 results, even at the planned 1.5× sample efficiency / +8 absolute points target, cannot distinguish M5 from M1. They will tell us "substrate facts help action selection." They will not tell us "PDS routes through attention."
The fix the panel converged on: Crafter v1 needs a 4-arm ablation, not a 2-arm.
| Arm | Description | What it tests |
|---|---|---|
| A | Frozen base, no PDS | Floor — pure base capability. The control. |
| B | Frozen base + serialized PDS in-context | M1 (RAG-Injection). The R7.6-as-planned arm. The thing M5 must beat. |
| C | Frozen base + PDS-conditioned head mask, no in-context PDS | M5 (the bet). Must reach C ≥ B at lower token cost to validate the mechanism. |
| D | Frozen base + corrupted/shuffled PDS-head mapping | Negative control. Must collapse on substrate-dependent decisions while preserving generic language. If D ≈ C, mask geometry is decorative — M5 is just sparse attention, not a substrate mechanism. |
Codex's hardest falsifier still applies: if a baseline given a hand-written 20-fact static checklist reproduces the gain, the substrate is doing no runtime work and the entire architecture is wrong. Pre-register this baseline. It costs nothing.
Substrate Route Stability.
Four of five models proposed a metric that does not exist in the R7-family corpus and that becomes meaningful only under M5. The convergent question: how stable are route plans under semantic equivalence? A high-quality M5 implementation produces near-identical masks for paraphrased queries and dramatically different masks for cross-domain queries.
This metric should be primary in the R8 eval suite. Crafter v1's planned reward and sample-efficiency targets remain. SRD adds: do paraphrased Crafter queries produce stable masks? Does mask Hamming distance grow with domain distance? Pre-register thresholds before week 1 kernel work begins.
Five futures from one root.
This is where the panel diverged. Each model surfaced a different new architectural primitive that becomes possible only if M5 works. None survive a "pick one" framing — they are five different futures from the same root. Together they constitute the architectural surface that becomes investible on top of Modulum-Native.
If a head mask is a substrate, substrates become first-class algebraic objects. You can OR, AND, XOR, subtract them.
What it unlocks
- Cross-domain reasoning natively. A clinical-trials query is
M_legal AND M_medical— heads activated by both. One boolean op replaces dual-pass adjudication. - Privacy by subtraction.
M_corpus − M_PIIruns the model with PII-relevant heads explicitly disabled. A non-trivial provable-safety property. - Substrate distance metric.
‖M_A XOR M_B‖is a domain-distance score. Drives "shared continued pretrain vs split."
Falsifier
If M_legal AND M_medical produces gibberish on cross-domain queries (intersection empty), masks are too domain-specific to compose. Empirical signature for CSA working: legal+medical AND-mask retains ≥30% of min(|M_legal|, |M_medical|) heads and produces coherent clinical-trials responses.
Cost
~50 lines of mask algebra. The work is the cross-domain Crafter analog. Buildable on the week-12 deliverable.
Every substrate-conditioned generation produces a route trace. Store it. Replay it.
What it unlocks
- A new eval methodology. Not "did the model answer correctly" but "did the model use the same substrate mechanism to answer."
- Route-grounded claim accountability. A claim is grounded only if the provenance fact activated a route that causally affected the output. Tested by replay.
- Cross-model substrate science. Two models on the same PDS compared on route geometry, not just answer quality.
Falsifier
If removing the claimed route does not change behavior, the route explanation is decorative. Target: ≥80% route-specific sensitivity and ≤10% irrelevant-route sensitivity on a controlled substrate task. Failure means MaskGate is only an optimization, not an accountability primitive.
Cost
Route plans are already compact (1024-bit head mask + KV-block bitsets). Replay infrastructure is logging + ablation harness. ~500 LOC.
Mask density becomes a quality/speed slider that requires no change in weights.
The three gears
- Gear 1 (10% heads) — chatbot greetings, classification. "Reflex" mode.
- Gear 2 (25% heads) — standard requests at the baseline 3× speedup.
- Gear 3 (50–70% heads) — complex multi-hop reasoning at full computational capacity.
What it unlocks
A single Modulum-Native weight file serves a continuous spectrum of cost/performance points. Load balancers can dynamically shift gears based on query complexity and system load. Inference becomes a dynamic, resource-aware computation rather than a static process.
Falsifier
If Gear 1 quality drops below the natural floor of base-model + retrieval, the mechanism is just lossy compression. If Gear 3 doesn't exceed Gear 2 by more than noise, the slider has no upper range.
If the token-importance mask provably contains all required facts AND attention is structurally incapable of attending outside the mask, the model is physically prevented from hallucinating about facts outside the PDS.
What it unlocks
A shift from probabilistic truth to structural truth. The model literally cannot mention X if the mask doesn't permit X. Compliance, regulated industries, court-defensible AI outputs. The strongest property of the five adjacents.
Open question
What guarantee class can be formalized? "No tokens outside vocab-window" is provable. "No false claims" is not. The buildable form is the strongest provable subset. Likely: structural-non-fabrication-of-tokens.
Falsifier
If the model can still hallucinate within the mask (recombining permitted facts incorrectly), the property is structural-non-fabrication, not structural-non-hallucination. Useful but weaker than the strong form.
Post-M5 validation, permanently prune the noise heads for specific domains. Modulum-7B-Legal with 25% of heads as a deployable artifact.
What it unlocks
Ultra-efficient model variants per domain. 3× inference speedup baked into the weights. <1% MMLU drop target. Static artifact, no runtime mask compilation overhead. Hyperscaler-grade efficiency for niche deployments.
Falsifier
If pruned models lose >5% accuracy on domain tasks or fail cross-domain generalization (MMLU drop >3%), the primitive fails. Tests whether dynamic masking (M5) can scale to static architectures.
Cost
4-week pruning experiment after M5 validation. Use head-importance maps to remove 75% of heads per domain, fine-tune on 2B tokens, deploy as a static model.
Composition
These five primitives are not mutually exclusive. They compose. A2 (Causal Route Replay) is a prerequisite for A1, A3, and A4 — all three need the route trace as an auditable object. A4 (Provable-Non-Hallucination) is sharpened by A1 (CSA) — M_corpus − M_PII is a constructive privacy proof. A3 and A5 are dual — runtime slider on the same weights vs static commitment per domain. The natural R9 round picks A2 first; it is the prerequisite primitive for three of the others.
The single thing that could kill M5.
Five of five models named the same critical doubt. The 75%-noise observation says ~75% of heads are removable on average. It does not say which 25% are signal varies stably with domain.
Head importance may not stably decompose by domain. If it doesn't, the mask compiler degenerates to a trivial map, M5 becomes a sparse-attention optimizer (still valuable), and the substrate-mechanism question reopens.
The same 25% are signal heads regardless of domain. PDS schema does not change which heads matter. Compiler learns a near-constant mask.
Head importance varies with domain but in a way not capturable by a small MLP from a fingerprint. Mask quality collapses below baseline.
The "noise" 75% is actually distributed stabilizer heads maintaining global coherence and OOD generalization. Removing them is brittleness, not pruning.
Trained-on-mask makes the model a "slave to the mask." Loses general reasoning when un-masked. Becomes a specialized retrieval engine.
Heads are too context-dependent for stable mapping. Masking discards critical signal; speedup negates accuracy.
The mitigation — Claude's threshold, 4 / 5 endorsement
Run head-importance stability tests in week 1, BEFORE kernel work. For 5–10 PDS pairs spanning low-affinity (legal vs Crafter) to high-affinity (legal-US vs legal-EU) domain pairs, compute Hamming distance on top-25% head sets across domains. Cost: <1 week. Information value: enormous.
Signal heads are domain-invariant. Ship M5 as a sparse-attention inference optimizer (~$300K of value). Pivot the substrate-mechanism thesis to M2 + M3 hybrid for R9.
M5 may work for high-affinity domain pairs (legal sub-domains) but not cross-vertical (legal × Crafter). Scope Modulum-7B-Native to within-vertical specialization first; defer cross-domain to A1 (CSA) work.
Routes are domain-specific. M5 is on solid ground. Proceed with kernel work. The full 12-week build is justified.
Codex's plan, with the R8 modifications.
Inside R7.7 Option B's $550K, 12-week envelope. Inference-first artifact for the week-2 gate; route-aware adapter and continued pretrain follow.
Day 1–3: head-importance stability test across 5–10 PDS pairs. Decision gate. If Hamming <20%, halt and pivot.
Days 4–10: PyTorch/Triton reference attention wrapper for Llama-compatible 7B. Runtime head + block masks.
Days 11–14: PDS training corpus from Crafter + one code-domain substrate task. Omnifact produces facts. HyperRemember retrieves top-k. First Head Affinity Table from brute-force ablation.
R7.7 GATE — ≥50% fewer substrate tokens than RAG at equal accuracy · MMLU within 2pts · SRD baseline measuredSystematic head-importance sweeps across Crafter and repo tasks. Learn A[layer, head, route_feature].
4-arm Crafter ablation (A: base, B: M1 RAG, C: M5, D: shuffled-mask negative control).
Kernel work moves from reference masking to skip-aware paged attention. Profile active head ratio, active KV block ratio, per-token latency, attention FLOP reduction.
STABILITY GATE — paraphrased queries must show <15% Hamming distanceTrain route-aware adapter or continue pretrain a 7B base with substrate-conditioned examples. Loss = LM loss + α · consistency loss + β · substrate task loss.
- Crafter score: ≥+8 absolute over RAG baseline OR 1.5× sample efficiency.
- SRD: ≥+5 absolute (route stability under paraphrase).
- Decode speed: ≥2.0× over serialized-PDS RAG at 4K context; stretch 3.0×.
- MMLU: within 2 points of base when no PDS mounted.
- Cross-domain generalization: train Crafter, test routing stability on repo with affinity-table adaptation only.
Four artifacts:
- OSS MaskGate runtime package for Llama-7B-compatible models.
- HuggingFace checkpoint or adapter trained for route-aware inference.
- Eval suite with the 4-arm Crafter ablation (RAG, static checklist, MaskGate, shuffled MaskGate).
- Paper-style technical report. Public claim narrowed: "PDS-conditioned attention routing reduces substrate token dependence and decode attention cost while preserving task performance."
How M5 composes with the R7-family unfinished list.
Eight unfinished threads from the R7-family final deck. M5 either composes natively, partially, or blocks each.
| Thread | Composes | How |
|---|---|---|
| PDS-as-Bayesian-Prior (M2) | YES · downstream | M2 is a layer above M5, applied to the LM head of a model whose attention is already PCHR-routed. They compose; M5 ships first. (Claude framing.) |
| KV-as-Substrate granularity | YES · fundamental | M5 implies per-fact granularity for high-confidence facts; per-section for low-confidence. Cache partitioning becomes mask partitioning. |
| In-Place TTT × PDS (M4) | YES · complementary | TTT = "model has learned domain" (slow path). M5 = "model is configured for domain" (fast path). Modulum's "no weights modified" is incompatible with M4 as primary but compatible as supplemental. |
| Multi-PDS arbitration (Hypernym Court) | YES · native | Mounting two PDSes = composing two route plans. Conflicts surface as conflicting head activations on the same token; runtime applies precedence (higher confidence > older > deeper-nested). Foundation for A1 (CSA). |
| Substrate diffing as scientific method | YES · native | Route plans are diffable bitsets. ‖M_run1 XOR M_run2‖ is a mechanical disagreement metric. Forge's CXDB disputes edge becomes a route-Hamming filter. |
| Living Corpus / EchoStream | PARTIAL | M5 supports between-action streaming (route updates between dispatches). Mid-token route updates are possible but risky — destabilizes generation. First implementation: session-level only. |
| Counterfactual Substrate | YES · cheap | "What if F were different" = fork activation set, recompile route plan, compare diffs. M5 counterfactuals are route-plan diffs (compact) vs M1 prompt branches (expensive) vs M4 weight-update branches (very expensive). |
| Vocab-window discipline at scale | M5 makes it MORE important | Codex's expected curve: 0–200 tokens (precision high), 200–1K (usable with strong typing), 1K–5K (lexical triggers fail, embeddings dominate), >5K (route entropy rises without ontology). Knee around 1K–2K for a 7B model. SectorPack legal needs two-stage routing. |
M5 partially blocks pure M3 economics and pure M4 persistence. It can use cache infrastructure and feed TTT, but it rejects either as the primary substrate mechanism.
The next round.
R8 did not resolve these. Each is a candidate seed for R9. The natural R9 picks one of the five adjacent possibles (A2, Causal Route Replay, is the prerequisite for A1, A3, A4) and develops it concretely.