2026 OPENCLAW
FALLBACK_
CONFIG_
DRIFT.
Configuring primary + fallback providers in OpenClaw protects uptime — until a successful fallback silently persists into openclaw.json and your declared primary never returns. This runbook separates disk drift vs session drift vs provider parsing forks, walks a five-step repair (stop Gateway → backup trio → restore intent JSON → prune stale session keys → layered restart probes), and adds launchd-specific checks for remote Mac tenants. Cross-read 429 failover logging, session store recovery, and LaunchAgent token drift.
1. Pain decomposition
Runtime convenience writes session-selected models back to static agents config unless guarded. Provider-qualified IDs may diverge from bare aliases. sessions.json may cache obsolete routing while channels look healthy. Remote launchd environments diverge from interactive SSH edits.
2. Symptom matrix
| Signal | Likely cause | Avoid |
|---|---|---|
| Disk primary renamed to backup | fallback write-back | git revert without session prune |
| Single channel wrong model | session carries runtime provider | blind reinstall |
| doctor allowlist mismatch | JSON vs session fork | endless npm bumps |
| Remote-only repro | plist env stale | half-copied plist |
3. Five-step repair
Step 1 Stop Gateway
systemd or launchctl — ensure WebSockets finish.
Step 2 Backup trio
openclaw.json, sessions.json, relevant jsonl fingerprints.
Step 3 Restore intent fields
Align agents.list[].model, defaults.primary, fallbacks; never change model without provider.
Step 4 Prune stale mappings
Remove keys pointing at unapproved runtime selections; log deletions.
Step 5 Layered restart
openclaw doctor, restart Gateway, three probe turns per channel with logs.
4. Decision matrix
| Trigger | Preferred | Fallback |
|---|---|---|
| Channels live but wrong model | tighten allowlist + JSON repair | nuclear session wipe |
| Shared gateway personas | freeze per persona | global single model override |
| launchd resurrects old env | validate WorkingDirectory vs JSON | adhoc shell exports |
5. Case note
“Fallback saved the night — two weeks later audit found disk primaries stuck on the tiny backup.”
Multi-provider failover masked the persistence bug until compliance labels drifted. Backup→repair→mapping prune→restart restored intent; MR templates now forbid silent fallback persistence. Governance classified it as configuration integrity, not uptime — cheaper than repeating audit drama.
6. Governance & observability
Treat JSON diffs like Terraform reviews. Emit structured logs with qualified models per successful turn. SIEM can alert when agents.list hashes move without CR IDs.
Remote Mac Gateways benefit from thermal stability yet tempt SSH-only edits — store launchd triple-check steps beside this runbook. Teams juggling Windows/Linux gateways may still prefer leasing MACGPU Apple Silicon nodes to reuse macOS scripts and Metal-adjacent tooling with clearer SLA ownership.
7. Provider qualification gates
Require explicit provider+model pairs in templates; verify channel overrides do not inherit stale persona pointers. Before restarting Gateway on remote hosts, diff printed launchctl env against doctor paths.
Bare aliases like gpt-5-mini are ambiguous once your registry hosts multiple vendors behind compatibility shims. Bake a lint step that rejects commits introducing model strings without a namespace, and mirror the same rule inside Helm/Kustomize overlays so Docker and bare-metal fleets cannot drift silently.
For Telegram/Web UI/Cron triple stacks, draw a channel→persona→qualified model matrix in your wiki. When one row still references the fallback persona while others show primary, your incident is almost certainly session skew—not provider outage.
8. Numeric gates
Three probe rounds pre/post repair with log fingerprints; record timestamps + HTTP statuses on provider switches; freeze fallback topology changes after two weekly drift incidents pending code audit; sha256 backups for session dirs.
Add two spreadsheet-ready metrics for finance partners: hourly burn before/after repair (token × price) and mis-tagged traffic percentage pulled from structured logs. When fallback persistence hides inside green dashboards, cost anomalies often surface first.
9. FAQ
Delete sessions only? Insufficient if JSON corrupted. Docker? Align volume HOME paths. Upgrade now? Repair first. Collaboration? Lockfiles / change windows; Git intent wins conflicts. Disable fallback? Narrow instead of deleting lifelines. Gradual rollout? Split personas and never share one agents.list entry between canary and prod chains—duplicate the block with distinct CR IDs.