R8 · Substrate Internals
The Mechanism Round
Headline

One mechanism. Five names. Five models in agreement.

R7 established that attention-through-PDS works. R7.5 produced products. R7.6 picked the world-model MVP. R7.7 picked the train-a-model option. R8 answers what none of those rounds asked: mechanically, how does the substrate actually condition inference?

Five reasoning models, asked to pick exactly one mechanism from seven candidates. Five picked the same one — Attention-Mask Conditioning — and gave it five different names. The strongest panel convergence in any round of the R7-R8 family, at the architectural layer.

5 / 5
Models picked M5 — Attention-Mask Conditioning. Strongest architectural convergence in the family.
7 → 1
Candidate mechanisms (M1–M7) reduced to a single architecturally over-determined choice.
3.04×
Decode speedup, no weights modified. M5 is the only mechanism that explains it arithmetically.
75%
Of attention is noise. M5 is the only mechanism that requires this observation to make economic sense.
Panel Convergence
The 3.04× decode speedup is a tell that Modulum's existing software-only attention modification is already implicitly doing M5. R8 didn't choose M5; it formalized what was already there. PDS is not memory injected into context. PDS is a routing program over attention.
01 · Convergence

One mechanism, five names.

When five reasoning systems with different priors converge on the same architecture and give it five different names, that is the structure pushing through. The names diverge because each model carries the convergence into a different adjacent possible — see §06.

ModelName givenDistinctive framingPick
Claude PCHR PDS-Conditioned Head Routing. Discrete per-head, gated by PDS schema fingerprint, computed once per dispatch, held constant across the dispatch window. M5
Codex MaskGate Pre-attention compiler emitting head-mask + KV-block-mask + optional sparse bias. Treats route plans as compact, diffable, replayable bitsets. M5
Gemini Modulum-SparseGate "Dynamic sparsity engine." PDS as the runtime conductor of attention. Frames sparsity as the central exploitable resource, not a side effect. M5
Gemma SAS · DHTS Sparse-Attention-Substrate via Dynamic Head-Token Sparsification. Block-sparse Triton kernel with head-group + token-importance two-stage masking. M5
Grok Domain-Specific Head Pruning Foreshadows M5 as a deployable artifact per domain (Modulum-7B-Legal with 25% of heads). Treats M5 as the validation step before static pruning. M5
02 · Mechanism

What the panel agreed on, mechanically.

Five Section-2 implementation sketches converged on a single picture, with kernel-level details. Pre-attention, per-layer, per-head. Inside the self-attention module of each transformer block. Between Q/K/V projection and softmax. Inside a custom Triton kernel forking FlashAttention-3/4 or SGLang/vLLM paged attention.

Prompt + state
HyperRemember query (top-k facts + entities)
PDS activation set
MaskGate compiler · maps PDS schema → route plan
Route plan
head_mask + kv_block_mask + optional sparse bias
Attention kernel
Block-sparse FlashAttention · skip inactive heads, skip masked KV blocks
LLM decode
Output + audit trace (route IDs, activated heads per fact)
What MaskGate emits head_mask[layer, head] → {0, 1, soft} kv_block_mask[layer, head, block] → {0, 1} fact_window_map[fact_id] → token_span | cached_span | virtual_span head_fact_affinity[layer, head, fact_type] → float attention_bias[layer, head, query_pos, kv_block] (optional)

For a 7B Llama-style model (32 layers × 32 heads), the head mask is 1024 bits. With KV cache paged into 16-token blocks at 4K context, each head needs 256 block decisions. Compact. Bitset-encodable. Cheap to pass into the attention kernel.

Three implementation phases

03 · The Argument

Why M5 is over-determined.

All five models gave the same three-fact case against M1, M2, M3, M4, M6, and M7. M5 is the unique mechanism that explains the empirical anchors Hypernym already owns AND the unique mechanism that promotes (rather than replaces) the existing public API surface.

75% noiserequires it

M1 ignores it. M2 uses it indirectly. M3 doesn't address it (caching, not architecture). M4 absorbs it eventually but adds work. M6 bypasses tokens but doesn't explain why attention had so much removable waste. M7 eventually integrates it but requires from-scratch training.

"Under PCHR it is the prerequisite. If attention were dense (10% noise), there would be no head budget to reallocate, the routing mask would be near-identity, and the mechanism would degenerate to baseline." — Claude

3.04× decodepredicts it

M1 cannot produce decode speedup (adds tokens). M2 adds latency. M3 speeds prefill, not decode. M4 adds work. M6/M7 may add work. M5 is the only mechanism that arithmetically predicts the observed speedup.

A mask zeroing 75% of head-token interactions yields ~4× theoretical attention speedup, partially offset by kernel overhead and variable head weight — landing at exactly 3.04×. The 3.04× is a tell that Modulum's existing software-only attention modification is already implicitly doing M5. R8 formalized what was already there.

HyperRememberpromoted

Under M1, HyperRemember serializes facts into the prompt — a content provider. Under M5, HyperRemember stays the substrate query layer but its output becomes the input to a router, not text spliced into the prompt. The vocab-window and entity slots become a fingerprint vector; the router maps fingerprint → mask.

M5 promotes HyperRemember up the stack. The alternatives either leave it where it is (M1, M3) or replace it (M6, M7).

04 · Crafter Fix

Crafter v1 as planned tests the wrong mechanism.

All five models flagged this in Section 1. The R7.6 architecture is HyperRemember pre-action query plus in-context substrate injection — that is the M1 baseline. Crafter v1 results, even at the planned 1.5× sample efficiency / +8 absolute points target, cannot distinguish M5 from M1. They will tell us "substrate facts help action selection." They will not tell us "PDS routes through attention."

The fix the panel converged on: Crafter v1 needs a 4-arm ablation, not a 2-arm.

ArmDescriptionWhat it tests
A Frozen base, no PDS Floor — pure base capability. The control.
B Frozen base + serialized PDS in-context M1 (RAG-Injection). The R7.6-as-planned arm. The thing M5 must beat.
C Frozen base + PDS-conditioned head mask, no in-context PDS M5 (the bet). Must reach C ≥ B at lower token cost to validate the mechanism.
D Frozen base + corrupted/shuffled PDS-head mapping Negative control. Must collapse on substrate-dependent decisions while preserving generic language. If D ≈ C, mask geometry is decorative — M5 is just sparse attention, not a substrate mechanism.

Codex's hardest falsifier still applies: if a baseline given a hand-written 20-fact static checklist reproduces the gain, the substrate is doing no runtime work and the entire architecture is wrong. Pre-register this baseline. It costs nothing.

05 · New Metric

Substrate Route Stability.

Four of five models proposed a metric that does not exist in the R7-family corpus and that becomes meaningful only under M5. The convergent question: how stable are route plans under semantic equivalence? A high-quality M5 implementation produces near-identical masks for paraphrased queries and dramatically different masks for cross-domain queries.

Codex
SRD
Substrate-Route Divergence. Hamming-style distance between route plans across paraphrased queries or corrupted PDSes.
Claude
Paraphrase stability
Same construct, framed as the falsifier of the head-stability doubt. Mask drift under paraphrase = the diagnostic.
Gemini
Head-importance map stability
Cross-domain consistency of which heads matter. Tested by running the same PDS through paraphrased queries and measuring head-overlap.
Gemma
Route entropy
Entropy of the mask distribution per query. Low entropy = routes are stably committed; high entropy = compiler is hedging.

This metric should be primary in the R8 eval suite. Crafter v1's planned reward and sample-efficiency targets remain. SRD adds: do paraphrased Crafter queries produce stable masks? Does mask Hamming distance grow with domain distance? Pre-register thresholds before week 1 kernel work begins.

06 · Adjacent Possibles

Five futures from one root.

This is where the panel diverged. Each model surfaced a different new architectural primitive that becomes possible only if M5 works. None survive a "pick one" framing — they are five different futures from the same root. Together they constitute the architectural surface that becomes investible on top of Modulum-Native.

A1 Compositional Substrate Algebra Claude

If a head mask is a substrate, substrates become first-class algebraic objects. You can OR, AND, XOR, subtract them.

What it unlocks

  • Cross-domain reasoning natively. A clinical-trials query is M_legal AND M_medical — heads activated by both. One boolean op replaces dual-pass adjudication.
  • Privacy by subtraction. M_corpus − M_PII runs the model with PII-relevant heads explicitly disabled. A non-trivial provable-safety property.
  • Substrate distance metric. ‖M_A XOR M_B‖ is a domain-distance score. Drives "shared continued pretrain vs split."

Falsifier

If M_legal AND M_medical produces gibberish on cross-domain queries (intersection empty), masks are too domain-specific to compose. Empirical signature for CSA working: legal+medical AND-mask retains ≥30% of min(|M_legal|, |M_medical|) heads and produces coherent clinical-trials responses.

Cost

~50 lines of mask algebra. The work is the cross-domain Crafter analog. Buildable on the week-12 deliverable.

A2 Causal Route Replay Codex

Every substrate-conditioned generation produces a route trace. Store it. Replay it.

What it unlocks

  • A new eval methodology. Not "did the model answer correctly" but "did the model use the same substrate mechanism to answer."
  • Route-grounded claim accountability. A claim is grounded only if the provenance fact activated a route that causally affected the output. Tested by replay.
  • Cross-model substrate science. Two models on the same PDS compared on route geometry, not just answer quality.

Falsifier

If removing the claimed route does not change behavior, the route explanation is decorative. Target: ≥80% route-specific sensitivity and ≤10% irrelevant-route sensitivity on a controlled substrate task. Failure means MaskGate is only an optimization, not an accountability primitive.

Cost

Route plans are already compact (1024-bit head mask + KV-block bitsets). Replay infrastructure is logging + ablation harness. ~500 LOC.

A3 Cognitive Gearing Gemini

Mask density becomes a quality/speed slider that requires no change in weights.

The three gears

  • Gear 1 (10% heads) — chatbot greetings, classification. "Reflex" mode.
  • Gear 2 (25% heads) — standard requests at the baseline 3× speedup.
  • Gear 3 (50–70% heads) — complex multi-hop reasoning at full computational capacity.

What it unlocks

A single Modulum-Native weight file serves a continuous spectrum of cost/performance points. Load balancers can dynamically shift gears based on query complexity and system load. Inference becomes a dynamic, resource-aware computation rather than a static process.

Falsifier

If Gear 1 quality drops below the natural floor of base-model + retrieval, the mechanism is just lossy compression. If Gear 3 doesn't exceed Gear 2 by more than noise, the slider has no upper range.

A4 Provable-Non-Hallucination Gemma

If the token-importance mask provably contains all required facts AND attention is structurally incapable of attending outside the mask, the model is physically prevented from hallucinating about facts outside the PDS.

What it unlocks

A shift from probabilistic truth to structural truth. The model literally cannot mention X if the mask doesn't permit X. Compliance, regulated industries, court-defensible AI outputs. The strongest property of the five adjacents.

Open question

What guarantee class can be formalized? "No tokens outside vocab-window" is provable. "No false claims" is not. The buildable form is the strongest provable subset. Likely: structural-non-fabrication-of-tokens.

Falsifier

If the model can still hallucinate within the mask (recombining permitted facts incorrectly), the property is structural-non-fabrication, not structural-non-hallucination. Useful but weaker than the strong form.

A5 Domain-Specific Head Pruning Grok

Post-M5 validation, permanently prune the noise heads for specific domains. Modulum-7B-Legal with 25% of heads as a deployable artifact.

What it unlocks

Ultra-efficient model variants per domain. 3× inference speedup baked into the weights. <1% MMLU drop target. Static artifact, no runtime mask compilation overhead. Hyperscaler-grade efficiency for niche deployments.

Falsifier

If pruned models lose >5% accuracy on domain tasks or fail cross-domain generalization (MMLU drop >3%), the primitive fails. Tests whether dynamic masking (M5) can scale to static architectures.

Cost

4-week pruning experiment after M5 validation. Use head-importance maps to remove 75% of heads per domain, fine-tune on 2B tokens, deploy as a static model.

Composition

These five primitives are not mutually exclusive. They compose. A2 (Causal Route Replay) is a prerequisite for A1, A3, and A4 — all three need the route trace as an auditable object. A4 (Provable-Non-Hallucination) is sharpened by A1 (CSA)M_corpus − M_PII is a constructive privacy proof. A3 and A5 are dual — runtime slider on the same weights vs static commitment per domain. The natural R9 round picks A2 first; it is the prerequisite primitive for three of the others.

07 · Convergent Doubt

The single thing that could kill M5.

Five of five models named the same critical doubt. The 75%-noise observation says ~75% of heads are removable on average. It does not say which 25% are signal varies stably with domain.

The Doubt — 5 / 5 Convergent
Head importance may not stably decompose by domain. If it doesn't, the mask compiler degenerates to a trivial map, M5 becomes a sparse-attention optimizer (still valuable), and the substrate-mechanism question reopens.
Claude · Failure A
Domain-invariant signal heads

The same 25% are signal heads regardless of domain. PDS schema does not change which heads matter. Compiler learns a near-constant mask.

Claude · Failure B
Non-linear PDS→head map

Head importance varies with domain but in a way not capturable by a small MLP from a fingerprint. Mask quality collapses below baseline.

Gemini
Holistic stabilizer heads

The "noise" 75% is actually distributed stabilizer heads maintaining global coherence and OOD generalization. Removing them is brittleness, not pruning.

Gemma
Catastrophic specialization

Trained-on-mask makes the model a "slave to the mask." Loses general reasoning when un-masked. Becomes a specialized retrieval engine.

Grok
Head entanglement

Heads are too context-dependent for stable mapping. Masking discards critical signal; speedup negates accuracy.

The mitigation — Claude's threshold, 4 / 5 endorsement

Run head-importance stability tests in week 1, BEFORE kernel work. For 5–10 PDS pairs spanning low-affinity (legal vs Crafter) to high-affinity (legal-US vs legal-EU) domain pairs, compute Hamming distance on top-25% head sets across domains. Cost: <1 week. Information value: enormous.

Pivot · Hamming < 20% Static mask

Signal heads are domain-invariant. Ship M5 as a sparse-attention inference optimizer (~$300K of value). Pivot the substrate-mechanism thesis to M2 + M3 hybrid for R9.

The hard middle · 20–40% Within-vertical

M5 may work for high-affinity domain pairs (legal sub-domains) but not cross-vertical (legal × Crafter). Scope Modulum-7B-Native to within-vertical specialization first; defer cross-domain to A1 (CSA) work.

Ship · Hamming > 40% Stably routable

Routes are domain-specific. M5 is on solid ground. Proceed with kernel work. The full 12-week build is justified.

08 · 12-Week Plan

Codex's plan, with the R8 modifications.

Inside R7.7 Option B's $550K, 12-week envelope. Inference-first artifact for the week-2 gate; route-aware adapter and continued pretrain follow.

Weeks 1–2Foundation

Day 1–3: head-importance stability test across 5–10 PDS pairs. Decision gate. If Hamming <20%, halt and pivot.

Days 4–10: PyTorch/Triton reference attention wrapper for Llama-compatible 7B. Runtime head + block masks.

Days 11–14: PDS training corpus from Crafter + one code-domain substrate task. Omnifact produces facts. HyperRemember retrieves top-k. First Head Affinity Table from brute-force ablation.

R7.7 GATE — ≥50% fewer substrate tokens than RAG at equal accuracy · MMLU within 2pts · SRD baseline measured
Weeks 3–6Attribution

Systematic head-importance sweeps across Crafter and repo tasks. Learn A[layer, head, route_feature].

4-arm Crafter ablation (A: base, B: M1 RAG, C: M5, D: shuffled-mask negative control).

Kernel work moves from reference masking to skip-aware paged attention. Profile active head ratio, active KV block ratio, per-token latency, attention FLOP reduction.

STABILITY GATE — paraphrased queries must show <15% Hamming distance
Weeks 7–10Modulum-Native run

Train route-aware adapter or continue pretrain a 7B base with substrate-conditioned examples. Loss = LM loss + α · consistency loss + β · substrate task loss.

  • Crafter score: ≥+8 absolute over RAG baseline OR 1.5× sample efficiency.
  • SRD: ≥+5 absolute (route stability under paraphrase).
  • Decode speed: ≥2.0× over serialized-PDS RAG at 4K context; stretch 3.0×.
  • MMLU: within 2 points of base when no PDS mounted.
  • Cross-domain generalization: train Crafter, test routing stability on repo with affinity-table adaptation only.
Weeks 11–12Ship

Four artifacts:

  • OSS MaskGate runtime package for Llama-7B-compatible models.
  • HuggingFace checkpoint or adapter trained for route-aware inference.
  • Eval suite with the 4-arm Crafter ablation (RAG, static checklist, MaskGate, shuffled MaskGate).
  • Paper-style technical report. Public claim narrowed: "PDS-conditioned attention routing reduces substrate token dependence and decode attention cost while preserving task performance."
09 · Threads Resolved

How M5 composes with the R7-family unfinished list.

Eight unfinished threads from the R7-family final deck. M5 either composes natively, partially, or blocks each.

ThreadComposesHow
PDS-as-Bayesian-Prior (M2)YES · downstreamM2 is a layer above M5, applied to the LM head of a model whose attention is already PCHR-routed. They compose; M5 ships first. (Claude framing.)
KV-as-Substrate granularityYES · fundamentalM5 implies per-fact granularity for high-confidence facts; per-section for low-confidence. Cache partitioning becomes mask partitioning.
In-Place TTT × PDS (M4)YES · complementaryTTT = "model has learned domain" (slow path). M5 = "model is configured for domain" (fast path). Modulum's "no weights modified" is incompatible with M4 as primary but compatible as supplemental.
Multi-PDS arbitration (Hypernym Court)YES · nativeMounting two PDSes = composing two route plans. Conflicts surface as conflicting head activations on the same token; runtime applies precedence (higher confidence > older > deeper-nested). Foundation for A1 (CSA).
Substrate diffing as scientific methodYES · nativeRoute plans are diffable bitsets. ‖M_run1 XOR M_run2‖ is a mechanical disagreement metric. Forge's CXDB disputes edge becomes a route-Hamming filter.
Living Corpus / EchoStreamPARTIALM5 supports between-action streaming (route updates between dispatches). Mid-token route updates are possible but risky — destabilizes generation. First implementation: session-level only.
Counterfactual SubstrateYES · cheap"What if F were different" = fork activation set, recompile route plan, compare diffs. M5 counterfactuals are route-plan diffs (compact) vs M1 prompt branches (expensive) vs M4 weight-update branches (very expensive).
Vocab-window discipline at scaleM5 makes it MORE importantCodex's expected curve: 0–200 tokens (precision high), 200–1K (usable with strong typing), 1K–5K (lexical triggers fail, embeddings dominate), >5K (route entropy rises without ontology). Knee around 1K–2K for a 7B model. SectorPack legal needs two-stage routing.

M5 partially blocks pure M3 economics and pure M4 persistence. It can use cache infrastructure and feed TTT, but it rejects either as the primary substrate mechanism.

10 · Open R9

The next round.

R8 did not resolve these. Each is a candidate seed for R9. The natural R9 picks one of the five adjacent possibles (A2, Causal Route Replay, is the prerequisite for A1, A3, A4) and develops it concretely.

Which adjacent possible (A1–A5) does R9 develop?A2 (Causal Route Replay) is the prerequisite primitive for three of the five — the natural pick. A4 (Provable-Non-Hallucination) is the highest-leverage outlier: could reshape how regulated buyers see the entire memory-vendor category.
What's the exact threshold formula on head Hamming distance?Claude's <20% / >40% is reasonable but not derived. Should we cluster domains hierarchically and set per-tier thresholds? Domain-pair distribution may not be uniform.
Can the Head Affinity Table generalize zero-shot to a new domain?Or does each domain require its own attribution sweep? Determines whether Modulum-7B-Native is "a model" or "a model + per-domain calibration."
What's the latency budget for mask compilation?All 5 models put it pre-attention; none gave hard numbers. If mask compilation costs >10ms, it eats the decode speedup. SRAM-resident mask compiler is required.
Is the route-aware adapter (Phase 2) actually necessary for the first artifact?Or can frozen-base + runtime mask work alone? Codex's plan says Phase 1 is enough for the week-2 gate; Phase 2 is for production quality. Open question.
What's the convergent failure mode if the head-stability test fails?Three of five models pivot to M2+M3 hybrid. Two (Codex, Gemma) pivot to a different framing of M5 (request-local attention pruning vs substrate routing). Resolve before week 1 starts.
Does mask compilation need to be learned end-to-end?Or is the two-stage HyperRemember-then-router pipeline sufficient? End-to-end may produce better masks but loses staged routing's auditability.
Cross-architecture transferability.The 75%-noise study covered 4 architectures (Llama, MiniMax, +2). Do head-affinity tables transfer across architectures, or does each architecture need its own bootstrap?
Can M5 be empirically distinguished from "really good prompt engineering" at scale?Codex's hardest falsifier (hand-written 20-fact checklist) handles small tasks. What's the equivalent for SectorPack legal at 5K vocab tokens?
What's the publication strategy?A narrow empirical paper ("PDS-conditioned attention routing") suffices for the week-12 ship. A broader architectural claim ("Reality Substrate") requires CSA + Causal Route Replay evidence in hand.