2026 OPENCLAW
FALLBACK_
CONFIG_
DRIFT.

Site reliability engineer inspecting gateway configuration logs

Configuring primary + fallback providers in OpenClaw protects uptime — until a successful fallback silently persists into openclaw.json and your declared primary never returns. This runbook separates disk drift vs session drift vs provider parsing forks, walks a five-step repair (stop Gateway → backup trio → restore intent JSON → prune stale session keys → layered restart probes), and adds launchd-specific checks for remote Mac tenants. Cross-read 429 failover logging, session store recovery, and LaunchAgent token drift.

1. Pain decomposition

Runtime convenience writes session-selected models back to static agents config unless guarded. Provider-qualified IDs may diverge from bare aliases. sessions.json may cache obsolete routing while channels look healthy. Remote launchd environments diverge from interactive SSH edits.

2. Symptom matrix

SignalLikely causeAvoid
Disk primary renamed to backupfallback write-backgit revert without session prune
Single channel wrong modelsession carries runtime providerblind reinstall
doctor allowlist mismatchJSON vs session forkendless npm bumps
Remote-only reproplist env stalehalf-copied plist

3. Five-step repair

Step 1 Stop Gateway

systemd or launchctl — ensure WebSockets finish.

Step 2 Backup trio

openclaw.json, sessions.json, relevant jsonl fingerprints.

Step 3 Restore intent fields

Align agents.list[].model, defaults.primary, fallbacks; never change model without provider.

Step 4 Prune stale mappings

Remove keys pointing at unapproved runtime selections; log deletions.

Step 5 Layered restart

openclaw doctor, restart Gateway, three probe turns per channel with logs.

jq '.agents.defaults.model' ~/.openclaw/openclaw.json > /tmp/model.before.json # repair... jq '.agents.defaults.model' ~/.openclaw/openclaw.json > /tmp/model.after.json diff -u /tmp/model.before.json /tmp/model.after.json || true

4. Decision matrix

TriggerPreferredFallback
Channels live but wrong modeltighten allowlist + JSON repairnuclear session wipe
Shared gateway personasfreeze per personaglobal single model override
launchd resurrects old envvalidate WorkingDirectory vs JSONadhoc shell exports

5. Case note

“Fallback saved the night — two weeks later audit found disk primaries stuck on the tiny backup.”

Multi-provider failover masked the persistence bug until compliance labels drifted. Backup→repair→mapping prune→restart restored intent; MR templates now forbid silent fallback persistence. Governance classified it as configuration integrity, not uptime — cheaper than repeating audit drama.

6. Governance & observability

Treat JSON diffs like Terraform reviews. Emit structured logs with qualified models per successful turn. SIEM can alert when agents.list hashes move without CR IDs.

Remote Mac Gateways benefit from thermal stability yet tempt SSH-only edits — store launchd triple-check steps beside this runbook. Teams juggling Windows/Linux gateways may still prefer leasing MACGPU Apple Silicon nodes to reuse macOS scripts and Metal-adjacent tooling with clearer SLA ownership.

7. Provider qualification gates

Require explicit provider+model pairs in templates; verify channel overrides do not inherit stale persona pointers. Before restarting Gateway on remote hosts, diff printed launchctl env against doctor paths.

Bare aliases like gpt-5-mini are ambiguous once your registry hosts multiple vendors behind compatibility shims. Bake a lint step that rejects commits introducing model strings without a namespace, and mirror the same rule inside Helm/Kustomize overlays so Docker and bare-metal fleets cannot drift silently.

For Telegram/Web UI/Cron triple stacks, draw a channel→persona→qualified model matrix in your wiki. When one row still references the fallback persona while others show primary, your incident is almost certainly session skew—not provider outage.

8. Numeric gates

Three probe rounds pre/post repair with log fingerprints; record timestamps + HTTP statuses on provider switches; freeze fallback topology changes after two weekly drift incidents pending code audit; sha256 backups for session dirs.

Add two spreadsheet-ready metrics for finance partners: hourly burn before/after repair (token × price) and mis-tagged traffic percentage pulled from structured logs. When fallback persistence hides inside green dashboards, cost anomalies often surface first.

9. FAQ

Delete sessions only? Insufficient if JSON corrupted. Docker? Align volume HOME paths. Upgrade now? Repair first. Collaboration? Lockfiles / change windows; Git intent wins conflicts. Disable fallback? Narrow instead of deleting lifelines. Gradual rollout? Split personas and never share one agents.list entry between canary and prod chains—duplicate the block with distinct CR IDs.