mapv10 Performance and Streaming Hardening Spec

This spec is the charter for the mapv10 performance/streaming hardening sequence. PR 1 through PR 4 have landed, R-24 removed the R-23 terrain create/dispose warning owners, and R-25 removed the remaining route upload/build warning owners. Remaining work is now budget ratcheting, scenario/process hardening, and fresh evidence-driven renderer fixes if new owners appear. It supersedes the earlier "broaden performance budgets" framing in next-work.md's prior T2-7 wording.

The thesis is harden and extend, not "build the strategic-map foundation." mapv10 already ships province-ID rasters, square-packed 2D LUTs, the border-SDF terrain shader, the worker pool, geometric-error terrain SSE selection with strict resolver fallback and predicted preload, the frame budget for auxiliary route batching, memoized label textures, the zoom-trace recorder, and enforced performanceBudget.zoomTrace gates. As of R-25, the historical canonical suite was green with zero hard failures and zero warnings. PROC later recorded that the current committed scenario artifact is incomplete and that the throttled slow-network omitted-slot metric is non-blocking stress evidence for a local-only product.

1. Scope and non-goals

In scope:

Add the lifecycle and hitch metrics that are not yet in Mapv10MetricBudget so subsequent optimization waves have owner frames.
Migrate the performanceBudget.zoomTrace schema to support distinct warn and fail groups without breaking existing scenario JSONs.
Frame-budget label and vector creation/disposal using a candidate -> resident -> visible -> fading_out -> retiring -> retired state machine.
Formalize parent fallback as a strict contract with eight invariants and resolver tests.
Keep terrain LOD on true terrain geometric-error screen-space error and ratchet lifecycle budgets from measured scenario evidence.

Out of scope:

Backend-neutral runtime abstraction. mapv10 is a generator and viewer; portability is data-shaped, not runtime-shaped.
Rebuilding the semantic overlay pipeline. Province-ID + LUT + SDF borders are already runtime.
Replacing label collision, priority sorting, or zoom-band logic. Those work; the gap is lifecycle churn.
mapv7 import. mapv7 is not in mapv10's lineage; mapv8/mapv9 are the documented references and the relevant pieces are already in place.
The Valenar gameplay layer. T1-5 Valenar WorldData export is a deferred product lane, not part of this hardening spec, and Valenar keeps its dummy fixture until an explicit import/package step chooses otherwise.

2. Code-state baseline (audited 2026-05-06)

Already shipped — DO NOT reimplement:

Mapv10MetricBudget carries frameP95Ms, frameP99Ms, frameMaxMs, slowFrameCount33Ms, slowFrameCount50Ms, longTaskCount, maxSchedulerPending, maxFrameBudgetPending, cacheEvictionsDelta, cacheHitRateMin. See Mapv10MetricBudget in viewer/src/scenarios/scenarioTypes.ts.
Province ID raster + square-packed 2D LUT + MAX_TEXTURE_SIZE validation (R-6 in next-work.md).
Border-SDF terrain shader path with uBorderOpacity + setBorderSdfEnabled (R-4).
Worker pool: HTTP fetch, byte-length checks, typed-array decode, mesh world-coordinate transform, per-vertex normal computation, vector-tile JSON parse all off the main thread (see the mapv10 README quick-start fixture notes).
Geometric-error LOD selection with resolver fallback and predicted preload in ssePixelsByKey (viewer/src/renderer/lod/VisibilitySet.ts). The old hysteresis pair is gone; RenderResolver absorbs visual transitions.
Frame budget for auxiliary route batching (R-19) and per-frame route fan-out cap.
Memoized label canvas/texture pairs by text; footprint material cache by kind (see the mapv10 README quick-start fixture notes).
Strategic route graph generated from Location adjacency, height/slope cost, lake/river penalties (R-12).
Per-LOD route widths and LodVisualConfig for route opacity, border SDF opacity, label budgets per camera band (R-15).
Mapv10ZoomTraceReport recorder + performanceBudget.zoomTrace enforced gate; route auxiliary streaming capped to 64 missing assets per frame (R-18, R-19).
Stable-id labels with min/max zoom bands and priorities (referenced in T2-2).

Covered by this spec, with remaining open work called out:

Per-frame route batch count, draw-call proxy, upload-byte proxy, build-time observability. Landed in PR 1; route upload/build warnings were closed by R-25.
Per-frame mesh, material, texture create and dispose counters. Landed in PR 1.
Per-frame label create, remove, churn counters; resident, visible, fading counts. Landed in PR 1 and reduced by PR 2.
WebGLRenderer.info.render.calls and .triangles per zoomTrace sample. Landed in PR 1.
Fallback slot count, omitted slot count, fallback-duration frames per slot. Landed in PR 3; keep the gates active.
Versioned warn/fail budget groups under zoomTrace. Landed in PR 1; keep backward compatibility for existing flat budgets.
Per-tile geometricError, parentId, childIds, cost metadata, and a true geometric-error SSE selector. Landed in PR 4/R-23.
Strict-contract parent fallback resolver with explicit failed/canceled/partial/predicted-prefetch tests. Landed in PR 3.
Frame-budgeted label creation + disposal using a runtime state machine. Landed in PR 2; vector-specific follow-up remains metric-driven.

3. PR sequence and dependency chain

Historical ordering. PRs 1-4 landed in this order; future wave briefs should use the landed state plus the R-24/R-25 follow-ups rather than treating any step below as still open.

PR 1  Lifecycle + hitch metrics + warn/fail schema migration   (T2-7, revised)
PR 2  Label/vector frame-budgeted lifecycle                    (T2-7b, new)
PR 3  Strict Parent Fallback Contract                          (T1-2, landed as R-22)
PR 4  Terrain geometric-error SSE upgrade                      (T1-3, landed as R-23)

Deferred product lane: T1-5 Valenar WorldData export. Different code area, different acceptance gate, and no renderer lifecycle merge contention after R-25, but it is not the next renderer-system task.

Dependency reasons:

PR 1 unblocks PR 2: label-churn and per-frame label create/remove counts must exist before label budgeting can be measured.
PR 1 unblocks PR 3: fallbackDurationFrames must exist in Mapv10MetricBudget before the parent-fallback contract can be verified by scenarios.
PR 1 unblocks PR 4: route batch + draw-call proxy must exist before SSE upgrade can be verified to not regress route draw overhead.
PR 3 unblocks PR 4: SSE selection's "render parent and request children" path is the parent-fallback contract; the contract must be strict before it is exercised more aggressively by SSE.

4. PR 1 — Lifecycle + hitch metrics + warn/fail schema migration

4.1 New keys in `Mapv10MetricBudget`

routeBatchCountMax
routeDrawCallProxyMax
routeUploadBytesProxyMax
routeBuildMsMax

meshCreateCountPerFrameMax
meshDisposeCountPerFrameMax
materialCreateCountPerFrameMax
materialDisposeCountPerFrameMax
textureCreateCountPerFrameMax
textureDisposeCountPerFrameMax

labelCreateCountPerFrameMax
labelRemoveCountPerFrameMax
labelChurnPerFrameMax
labelResidentCountMax
labelVisibleCountMax
labelFadingCountMax

renderCallsMax
trianglesMax

fallbackSlotCountMax
omittedSlotCountMax
fallbackDurationFramesMax

consecutiveFramesOver33Ms
consecutiveFramesOver50Ms

consecutiveFramesOver{33,50}Ms is run-length, not count. It is a separate key from the existing slowFrameCount{33,50}Ms, which is total count. Both stay; they measure different failure modes.

4.2 Schema migration for warn vs fail

Pre-PR 1: Mapv10ScenarioPerformanceBudget extends Mapv10MetricBudget with zoomTrace?: Mapv10MetricBudget. Every nested key is hard-fail.

PR 1 target: zoomTrace accepts either shape:

type ZoomTraceBudget =
  | Mapv10MetricBudget
  | { fail: Mapv10MetricBudget; warn?: Mapv10MetricBudget };

Discriminator: presence of a fail key on the object. If absent, the object is treated as the legacy flat-fail shape. If present, fail keys hard-fail and warn keys report warnings without flipping the CI exit code.

Parser tests must cover four shapes:

Legacy flat zoomTrace with mixed keys.
New shape with fail only.
New shape with fail and warn.
Absent zoomTrace.

The two existing R-19 scenario IDs continue to pass hard gates. They may use the grouped shape so stale hard gates can move to warn-only while current hitch owners remain visible in performanceBudgetWarnings.

4.3 Renderer instrumentation

WebGLRenderer.info snapshot at frame end, written into the active zoomTrace sample.
Mesh, material, texture create/dispose counters incremented at all call sites (or behind a single creation/disposal helper) and sampled per frame.
Label create/remove/churn counters owned by the label system, emitted per zoomTrace sample.
Route batch/draw/upload counters owned by the route batcher, emitted per zoomTrace sample.
fallbackSlotCount, omittedSlotCount, and per-slot fallbackDurationFrames recorded by the planner/resolver and surfaced to zoomTrace.

Counters must be cheap. No per-frame allocations beyond a fixed-size sample object. Detailed traces enabled only when the scenario carries a zoomTrace block.

4.4 Acceptance for PR 1

The two existing R-19 scenarios (mapv10_zoom_continent_to_location_clean_lowland, mapv10_zoom_realm_to_location_overlays_z5) PASS hard-fail gates while recording warn-only lifecycle/hitch breaches.
Close z5 clean/overlay traces record the new lifecycle and hitch metrics.
Schema parser tests cover the four shapes in 4.2.
npm run scenarios reports warn vs fail separately in JSON and inline CLI status.
Real-browser pass via Chrome MCP at http://100.97.188.110:5443/?manifest=/mapv10/runs/continent-lod6/manifest.json: load, screenshot every map mode, no regression vs R-19 baseline. __mapv10Ready.errors empty.

5. PR 2 — Label/vector frame-budgeted lifecycle

Status after R-21: the label half of this PR has landed. Labels now allocate only after candidate selection, and label creation/disposal is capped by the label frame-budget lane. The remaining vector work is route/marker-specific and is tracked separately when metrics show it matters.

5.1 Runtime state machine

candidate -> resident -> visible -> fading_out -> retiring -> retired

candidate: passed priority, zoom-band, collision, residency checks; no GPU resources allocated yet.
resident: GPU resources (canvas, texture) allocated; not yet drawn.
visible: drawn this frame.
fading_out: visibility revoked; alpha is animating to zero.
retiring: alpha at zero, awaiting disposal slot.
retired: disposed.

Vectors (route ribbon overlays, footprint shapes) follow the same machine for create/dispose budgeting; they do not have visible/fading_out states in the same sense — for vectors, visible and fading_out collapse into a binary commit/decommit step that respects the per-frame budget.

5.2 Rules

Texture/material creation only after a candidate survives priority, zoom-band, collision, residency.
Label identity stable across nearby frames. Threshold-crossing toggles forbidden.
Hysteresis on visibility transitions (already partially present; extend to lifecycle transitions).
Caps in LodVisualConfig remain UX controls. They do not double as performance crutches.
Per-frame creation budget. Per-frame disposal budget. Both consumed by mesh budget for vectors and label-texture allocation budget for labels.
All counters from PR 1 emit per-frame: labelCreateCountPerFrameMax, labelRemoveCountPerFrameMax, labelChurnPerFrameMax, labelResidentCountMax, labelVisibleCountMax, labelFadingCountMax.

5.3 Acceptance for PR 2

Label churn drops measurably in the close z5 overlay scenario versus PR 1 baseline.
frameMaxMs improves versus PR 1 baseline (not just frameP95Ms).
No readability regression. Real-browser pass via Chrome MCP at every map mode and at the canonical zoom (continent -> realm -> province -> location).
LodVisualConfig caps untouched. UX surface unchanged.
Canonical zoom scenarios record lifecycle counters and assert churn-per-frame via labelChurnPerFrameMax.

6. PR 3 — Strict Parent Fallback Contract

Status after R-22: landed. RenderResolver now owns the per-primary fallback ledger and scenario zoomTrace consumes those metrics directly. Historical R-22 evidence recorded non-zero fallback duration with zero omitted slots; later remote-throttle evidence is tracked as non-blocking stress evidence under the local-only product decision.

6.1 Tile runtime state model

unseen -> requested -> decoding -> resident -> rendered -> stale -> retiring -> failed

Failed and canceled are siblings of requested / decoding; they re-enter unseen after counting toward fallback metrics.

6.2 Eight invariants

A parent tile remains renderable until ALL required children for the current view are resident AND committed.
A failed child request does not unmount the parent.
A canceled child request does not unmount the parent.
If no direct child is resident, the resolver uses a resident ancestor or the previous rendered tile when available.
The "no candidate resident: omit slot" path is explicitly counted (omittedSlotCount) and budgeted.
Disposal is frame-budgeted and never co-occurs with a large child commit when that demonstrably causes a hitch.
Fallback duration is recorded in zoomTrace as fallbackDurationFrames per slot.
Predicted-snapshot prefetches respect the same contract — predicted children that fail or get canceled do not destabilize the current render.

6.3 Resolver tests

child pending -> parent remains visible
child failed -> parent remains visible
child canceled -> parent remains visible
partial children ready -> parent fills only the missing slots
previous rendered tile fallback works during planner churn
omittedSlotCount increments when no fallback exists
fallbackDurationFrames appears in scenario output for a constructed slow-stream scenario
predicted-prefetch failure does not unmount the current render

These attach to RenderResolver and TileStreamingPlanner (and any new fallback resolver introduced by the wave).

6.4 Acceptance for PR 3

All resolver tests above pass.
Scenario mapv10_continent_to_location_slow_network (using throttled fetch) records fallback metrics as future-resilience evidence. Because mapv10 is a local-only product, remote-throttle omitted slots are warning evidence rather than a blocking product gate.
Existing R-19 scenarios show fallbackDurationFramesMax ≤ baseline value chosen during the wave.
Real-browser pass: no visible terrain holes during forced slow zoom; oblique close shot during slow stream shows parent terrain, not blank ground.

7. PR 4 — Terrain geometric-error SSE upgrade

Status after R-25: landed, paced, and green in the historical canonical suite. VisibilitySet consumes TileCoordinate.geometricError through the standard terrain SSE formula and reports true geometric-error pixels in ssePixelsByKey. Tile manifests/types/schemas carry tileId, parentId, childIds, 3D bounds, measured generator geometricError, and cost hints. __mapv10TerrainSelectionProbe exposes the runtime selector state. R-24 then staged terrain render commits, capped terrain sidecar texture creation, capped retired-terrain disposal, and ratcheted the clean fallback-slot warn baseline to 32 because the strict resolver intentionally records fallbackSlotCountMax=28 with omittedSlotCountMax=0. R-25 removed unused route normals, bypassed merge copies for one-member route batches, normalized oversized intermediate route aggregates to the coarse z1 fallback where available, and canceled never-drawn stale route batch queue work. The 2026-05-06 canonical scenario proof passed all 23 scenarios with 0 hard failures and 0 warnings; newer 27-definition scenario evidence must be regenerated because the committed latest artifact stores only 25 result entries.

7.1 Why this is an upgrade, not a rewrite

Pre-R-23, VisibilitySet selected on ssePixelsByKey = max(widthPx, heightPx) — the projected screen size of the tile bounds. That was not geometric error; the variable name was historically loose. R-23 replaced that with true geometric-error SSE while keeping the existing planner, strict resolver fallback, and predicted-snapshot machinery.

7.2 New tile metadata

Schema additions in schema/ and the tile manifests:

parentId: string | null
childIds: string[]
bounds: { minX, minY, minZ, maxX, maxY, maxZ } (verify already-present; codify if partial)
geometricError: number — max world-space deviation between this tile and the next finer level or the source heightfield, in metres.
estimatedDecodeMs?, estimatedUploadBytes?, estimatedTriangles? — optional cost hints for prioritization.

7.3 Selection rule

geometricSSE = (geometricError * viewportHeightPx)
             / (2 * tan(verticalFovRadians / 2) * max(distanceToCamera, eps))

rasterSSE    = (sampleFootprintMetres * viewportHeightPx)
             / (2 * tan(verticalFovRadians / 2) * max(distanceToCamera, eps))

if frustum-culled: skip
else if geometricSSE <= 1.5 px AND rasterSSE <= 6 px: render this tile
else if children resident: render children
else: render parent (R-22 contract) and request children with priority
       weighted by SSE excess + camera-center distance + predicted camera velocity

The selector enforces two SSE targets, not one. The 1.5 px target governs height-displacement (geometric) error per the standard Cesium / 3D Tiles formula; that is the AAA SSE target this spec was originally written against. The 6 px target governs the per-tile raster sidecar sample footprint — water masks, splat weights, biome ids, border SDFs, normal/roughness. Geometry SSE alone misses chunky-coastline / chunky-water-mask / chunky-splat crawl, because the visible "terrain" is a coupled product: mesh displacement plus per-tile raster sidecars, and the coarse-LOD raster sample projects to many more pixels than the coarse-LOD height-error metric implies.

Both targets are locked. TERRAIN_GEOMETRIC_ERROR_TARGET_PX = 1.5 and TERRAIN_RASTER_SAMPLE_ERROR_TARGET_PX = 6.0 live in viewer/src/renderer/lod/VisibilitySet.ts. Future changes require visual and scenario evidence that a target itself is wrong, not "raise the target so scenarios pass." See wave-3b-lod-ladder-camera-decision.md for the ladder-vs-camera architectural decision this rule currently surfaces at close-zoom camera bands.

7.3a Wave 3a: post-Wave-2 ladder vs target gap

After Wave 2 of the LOD/Data-Coherence Foundation work landed (resolver redesign, protect-set, ancestor walk, sidecar contract), the explorer measured the deepest LOD in the canonical ladder (z=5) against the camera envelope. Result: the current z=0…z=5 ladder cannot satisfy the 6 px raster-sample target at altitudes below approximately 226 km — the location-zoom camera band (60–200 km clearance) operates above the documented target by design. The geometric (1.5 px) target is satisfied across the same band because tile.geometricError scales much more aggressively than the raster sample footprint.

The Wave 3a implementation track:

did NOT raise either target. TERRAIN_GEOMETRIC_ERROR_TARGET_PX stays at 1.5 px; TERRAIN_RASTER_SAMPLE_ERROR_TARGET_PX stays at 6.0 px.
did NOT raise the camera floor. OrbitControls.minDistance stays at the existing envelope.
did NOT extend the ladder. The generator remains the source of authority.
DID split the telemetry by role and metric so the gap is visible. terrainSelectionProbe.primaryRasterSsePxMax reports the unmasked raster SSE on primary tiles only; the structural-parent's mechanically-2× number lives in underlayRasterSsePxMax.
DID calibrate per-scenario gates to the post-Wave-2 measured ladder. Close-zoom oblique scenarios at <100 km clearance carry primaryRasterSsePx ≤ 20–30 px bounds with explicit note fields documenting that the bound is above the 6 px target by design.

The accepted ladder decision lives at wave-3b-lod-ladder-camera-decision.md: future work extends the physical LOD ladder with z6/z7. Do not raise the camera floor or relax the raster SSE target.

7.4 Hybrid rule (codified in code, not just docs)

Terrain geometry: SSE.
Terrain detail textures: SSE-permissible (decided per-asset family).
Labels, borders, claims, map modes, tooltip behavior: discrete continent/realm/province/location tiers. Unchanged.
Routes: tier-driven visibility and width (R-15) plus SSE-permissible far-zoom simplification only if measurement justifies it.

7.5 Acceptance for PR 4

New __mapv10TerrainSelectionProbe exposes per-tile sse, geometricError, and selection decision.
Resolver tests cover: low SSE keeps parent; high SSE requests children; missing children invoke parent fallback (re-using R-22 tests); out-of-frustum tiles are not requested; canceled children do not remove parent.
Existing R-19 scenarios PASS without frameP95Ms or frameMaxMs regression.
Close zoom (z5 lowland and z5 highland) shows visibly more triangulated foreground vs R-22 baseline at equal frame budget.
Real-browser pass: smooth zoom, no new holes, no flicker.

8. Open decisions captured by this spec

Warn/fail ratchet policy. Nested zoomTrace.fail / zoomTrace.warn groups are locked. Decide when current zero-warning baselines should graduate selected warning metrics to hard-fail.
SSE target pixel thresholds. Both TERRAIN_GEOMETRIC_ERROR_TARGET_PX = 1.5 and TERRAIN_RASTER_SAMPLE_ERROR_TARGET_PX = 6.0 remain locked. The Wave 3a measurement showed the current ladder cannot meet the 6 px raster target at close-zoom camera bands; the accepted architectural response is z6/z7 physical LOD extension in wave-3b-lod-ladder-camera-decision.md. Raising either target to mask the gap is forbidden.
Fallback-duration budget ratchet. omittedSlotCountMax=0 is hard-fail; fallback duration is measured and should become a tighter fail gate after repeated stable baselines.
Whether vectors share label state machine or get their own. Recommend shared mechanic with vector-specific collapsed states (see 5.1). Lock only if vector churn reappears as an owner metric.

9. Real-browser verification gate

Per Step 3 of wave-protocol.md and the scenario workflow in scenarios.md, every PR's Step 3 Validate MUST include a Chrome MCP browser session that:

Navigates to http://100.97.188.110:5443/?manifest=/mapv10/runs/continent-lod6/manifest.json.
Reads window.__mapv10Ready.errors and asserts empty.
Screenshots every current MapModeId value (geography, political, routes, height, slope, normal) plus any new modes the wave introduces.
Performs the canonical zoom (continent -> realm -> province -> location) once per scenario in the wave's scope.
Reports observed frameMaxMs and pendingRequestsMax against the scenario budget.

Code-grep PASS without browser PASS is a false PASS for any PR in this spec.

10. How `next-work.md` references this spec

T2-7 (revised) -> PR 1.
T2-7b (new, Tier 2) -> PR 2.
R-22 resolved T1-2 / PR 3.
R-23 resolved T1-3 / PR 4.
R-24 resolved T2-9 terrain lifecycle pacing/baseline ratchet from the R-23 warnings.
R-25 resolved T2-8 route adaptive batching/upload pacing from the R-24 warnings.
T1-5 (existing, Tier 1 wave) -> unchanged; runs in parallel.

When a wave brief quotes this spec, it should cite section number (e.g. "scope per docs/performance-and-streaming-hardening-spec.md §6") so Step 3 validators check against this charter, not against the implement agent's narrower self-definition.

1. Scope and non-goals​

2. Code-state baseline (audited 2026-05-06)​

3. PR sequence and dependency chain​

4. PR 1 — Lifecycle + hitch metrics + warn/fail schema migration​

4.1 New keys in Mapv10MetricBudget​

4.2 Schema migration for warn vs fail​

4.3 Renderer instrumentation​

4.4 Acceptance for PR 1​

5. PR 2 — Label/vector frame-budgeted lifecycle​

5.1 Runtime state machine​

5.2 Rules​

5.3 Acceptance for PR 2​

6. PR 3 — Strict Parent Fallback Contract​

6.1 Tile runtime state model​

6.2 Eight invariants​

6.3 Resolver tests​

6.4 Acceptance for PR 3​

7. PR 4 — Terrain geometric-error SSE upgrade​

7.1 Why this is an upgrade, not a rewrite​

7.2 New tile metadata​

7.3 Selection rule​

7.3a Wave 3a: post-Wave-2 ladder vs target gap​

7.4 Hybrid rule (codified in code, not just docs)​

7.5 Acceptance for PR 4​

8. Open decisions captured by this spec​

9. Real-browser verification gate​

10. How next-work.md references this spec​