1. Why upgrades are more than swapping binaries
(1) State root drift: older drops used ~/.moltbot or legacy CLAWDBOT_*/MOLTBOT_*; current lines expect ~/.openclaw plus OPENCLAW_*. Upgrading the package without migrating data yields a split brain config.
(2) doctor --fix limits: it repairs common drift but not expired channel tokens, reverse-proxy path changes, or bespoke skill mounts — review diffs before applying.
(3) Device auth v2: nonce signing raises the client floor; stale CLIs look “connected” yet never enter the workflow. launchd without the same env as an interactive shell amplifies “works locally, fails on server.”
2. Migration matrix
| Axis | Before | Target |
|---|---|---|
| Secrets root | ~/.moltbot, scattered legacy env |
~/.openclaw; canonical OPENCLAW_* |
| Daemons | plist/systemd with stale paths | Explicit EnvironmentVariables blocks |
| Auth generation | v1 signatures lingering | v2 everywhere; upgrade clients in lockstep |
3. Five-step runbook
- Snapshot: tarball state dir, workspace, plist/unit,
openclaw --version, webhook screenshots. - Read-only diagnosis:
openclaw doctorwithout--fix; capturestatus/gateway/logs. - Controlled fix: trial
doctor --fixon a copy; apply to prod only after reviewing diffs. - Env sweep:
grep -Rfor legacy prefixes in shell, CI, and plist files. - Smoke + rollback: channel probe, minimal skill, metrics; on failure follow Gateway rollback article.
4. Citeable thresholds
Discussion numbers — replace with your release notes:
- If three consecutive
doctorruns report the same drift blocking channels, pause minor upgrades until directory/env migration completes. - When auth errors exceed 20%/hour tied to one client build, upgrade that client for v2 before restarting Gateway again.
- If plist env differs from interactive shell on any secret-bearing
OPENCLAW_*key, the change is not gate-ready.
5. Remote Mac launchd matrix
| Symptom | Likely cause | Action |
|---|---|---|
| Manual start OK, launchd fails | Missing PATH/API keys in plist | Inject explicit EnvironmentVariables |
| One channel mute post-upgrade | Webhook/signature defaults shifted | Re-pair per upgrade notes; see silent Gateway guide |
| Skill installs but executes EACCES | Tighter sandbox or workspace path | Re-check permissions; see attack-surface hardening |
6. FAQ
Does --fix touch workspace? It may rewrite config pointers; snapshot first.
Docker installs? Pick a single source of truth between host bind-mounts and container paths — never both casually.
Downgrade auth to v1? Only in isolated labs with a sunset date; production should upgrade clients.
Should I delete ~/.moltbot immediately? Only after verifying the new tree contains equivalent secrets and channels validate; keep the tarball for at least one release cycle.
What if doctor reports green but channels are still mute? Escalate to webhook delivery logs and signature headers — doctor validates local files, not upstream SaaS configuration drift.
How often to rerun the ladder after minor bumps? After every minor that mentions config or auth in release notes; for pure doc-only releases, a reduced smoke (one channel + one skill) still pays off.
7. Case study: silent Slack after two minors
Stacked issues: stale binary path in plist, token only in .zshrc, Slack signature header mismatch. Fixing any single layer produced flaky recovery; full env diff zeroing plus channel re-pair stabilized the system.
Archiving doctor output and plist diffs per upgrade cut repeat incidents from hours to minutes and satisfied audit asks.
The team added a lightweight cron that posts a synthetic “ping” message through each configured channel after every deploy; failures page on-call within five minutes instead of waiting for a human to notice silence during a holiday.
Unlike a big-bang 2.0 story, this runbook targets frequent minor bumps; if you still run MoltBot-era trees, finish the migration article first.
Post-upgrade, revalidate bind addresses and volume mounts — new defaults may tighten exposure.
Tokenizer and proxy-timeout style regressions also appear as “random” channel loss; version those alongside OpenClaw.
Archive shasum (or equivalent) of the resolved openclaw binary for each change ticket. Rollbacks must restore both package version and plist binary path — otherwise you downgrade the npm tree while launchd still points at an orphaned binary. On multi-user hosts, mismatches between the daemon UserName and the account that holds API keys produce intermittent auth failures that look like flaky networks.
If Homebrew and npm-global copies coexist, force a single PATH winner before running doctor; otherwise fixes apply to one copy while launchd launches the other, yielding “already latest” messages alongside stale errors.
Run a dry-run upgrade on a staging Mac that mirrors production plist keys before touching the remote host. Diff the generated doctor output and ensure no unexpected file moves touch your workspace bind mounts. When multiple channels are enabled, replay webhooks from a recorded fixture rather than spamming real users.
Document the exact package manager channel (npm vs pnpm global vs container digest) in the change ticket. Mixed channels are the fastest way to “fix” the wrong state directory and double downtime.
8. Observability during the change window
Track Gateway exit codes, channel probe success rate, skill cold-start median, auth-failure logs/hour. All four bad → env/root; auth-only spike → client version matrix.
Correlate auth failures with client user-agents or device IDs when logs expose them; a single outdated mobile build can dominate error share while desktop users stay healthy, misleading triage toward “server regression.”
| Signal | Meaning | Next |
|---|---|---|
| Repeating missing-key hints | State not unified under ~/.openclaw |
Stop upgrading; migrate dirs |
| Interactive OK / daemon broken | launchd drift | Diff plist vs shell env |
| Night-only failures | Different user context | Align user and key visibility |
8.1 CI, staging pins, and why “latest” is not a release artifact
Plenty of teams teach CI to run npm i -g openclaw@latest every night and then treat a green job as proof the fleet is safe. That ignores three realities: the runner may ship a different Node major than production, the default shell may populate a different PATH, and artifact metadata rarely records the resolved semver or digest. The fix is boring but decisive: pin an exact version or digest in CI, write the resolved absolute binary path plus shasum into build metadata, and make production change tickets cite that metadata instead of the word “latest.”
Staging Gateways that share the same Slack workspace as production need separate bots or channels. A mis-aimed webhook replay during rehearsal can pollute production audit trails and create false incidents. When you clone a production launchd plist for staging, change only secrets and listen ports, then run the full smoke suite. If staging’s plist suddenly contains an extra EnvironmentVariables block that production lacks, you do not “fix staging”; you merge that structure back to production because drift already happened.
During a version freeze, ban parallel interpretations of OPENCLAW_CONFIG_ROOT. One operator drops values in /etc/environment, another edits files under the user home, and whichever wins after the next daemon restart feels random. Freeze global injections for the week and pick one authoritative surface—single plist or single systemd drop-in—then attach a unified diff -u to the ticket.
Configuration management tools (Ansible, Chef, Puppet) often render templates twice: once for login sessions and once for daemons. If those renders are minutes apart, you get “works at lunch, breaks overnight” mysticism. Stamp the template revision into a plist comment line so on-call engineers can line it up with the CM commit without grep archaeology.
Upgrades are not only for humans. Any cron or watchdog that shells out to openclaw for health checks must advance with the major CLI; older scripts may misread new exit codes and mark a healthy Gateway as critical, triggering restart storms. Add those scripts to the same change calendar as the server binary.
Finally, rehearse rollback with the same data you used going forward: restore the snapshot, reinstall the prior digest, replay channel probes, and only then declare the exercise complete. A rollback that skips doctor output is as incomplete as an upgrade that skipped it.
9. Closing: Windows/Linux paths diverge harder
(1) Non-Mac service managers multiply failure modes without a frozen state contract. (2) Remote Apple Silicon Macs align better with documented launchd flows and creative automation stacks. (3) MACGPU offers low-friction remote Mac rentals for always-on Gateway validation — CTA below, no login wall. (4) Upgrades without snapshots + doctor logs are incomplete changes.
Windows services and Linux systemd units often hide environment inheritance behind layered wrappers — WSL bind mounts, snap confinement, or distro-specific Node paths. The same OPENCLAW_* key may need to appear in three places before the daemon, the CLI, and the skill subprocess all agree. Document each layer explicitly instead of assuming “export in .bashrc” propagates.
Capacity planning matters too: after auth v2, short-lived reconnect storms can spike CPU on small VPS instances during mass client upgrades. Rate-limit client rollouts or temporarily scale the Gateway host while the fleet catches up.
Security reviewers increasingly ask for proof that doctor --fix did not broaden file permissions; attach before/after ls -le snippets for sensitive directories when submitting change records.
10. Tie-in: install matrix & Docker prod
npm global, Compose, and pnpm source installs use different data roots — know which chain you repair before copying files across paths.
When Compose pins an image digest, treat that digest as part of the upgrade contract: bumping only the npm global on the host does not move the container’s baked-in CLI. Conversely, bind-mounting ~/.openclaw into a container while also editing the same tree from the host invites race conditions during doctor --fix; serialize writers or use a staging copy.
For teams running CI that shells out to openclaw, pin the CI image or runner AMI alongside the server version. CI drift is a frequent source of “works in prod Gateway but fails in nightly automation” because the job picks up a different minor from cache.
Finally, keep a one-page “upgrade card” per environment listing binary path, plist label, data root, and channel list. During incidents, that card should be the first attachment in the war room — not a live grep session.
11. Week-zero checklist after the green doctor
A green doctor screen is necessary but not sufficient. The week after a production upgrade should follow a deliberate checklist so regressions surface while context is still fresh. Day zero, capture a second doctor tarball after real traffic returns; diff it against the pre-upgrade capture and reject any surprise file moves. Day one, run channel probes from both the daemon account and an admin account to confirm permission symmetry. Day two, exercise the slowest skill you own — cold start plus retries — because auth v2 and sandbox tightening often appear first as sporadic skill failures rather than outright Gateway crashes.
Day three, review structured logs for auth error clustering by client build. If one cohort dominates, schedule a forced upgrade window instead of chasing server knobs. Day four, validate backups and snapshots: confirm the restore path still matches the plist you shipped, not an older label you forgot to retire. Day five, hold a fifteen-minute retrospective with whoever executed the change and whoever watched the metrics; write down one concrete improvement to the upgrade card.
Throughout the week, watch for “silent success” traps: channels that accept messages but drop rich payloads, or webhooks that succeed with HTTP 200 yet fail signature validation on threaded replies. Add synthetic checks that mirror real message shapes, not only ping messages.
If you operate multiple environments, stagger rollouts by risk: internal sandbox first, partner-facing staging second, customer-facing production last. Keep the time gap large enough to observe at least one full daily traffic cycle between stages. Racing all three in one evening guarantees you will learn nothing except how fast Slack notifications arrive.
Document dependency edges explicitly: reverse proxies, TLS terminators, corporate MITM appliances, and DNS TTLs all participate in auth and webhook stories. When any of those layers change independently of OpenClaw, rerun a reduced smoke even if the semver stayed flat.
Finally, connect upgrades to business calendars. Major marketing pushes, quarter-end freezes, and on-call holidays should influence timing more than npm’s publication timestamp. A boring delay beats a heroic rollback while half the company is offline.
When the week closes clean, archive the checklist results next to the doctor tarballs. Auditors and future you will care far more about that bundle than about the pull-request title. If anything flaked, keep the flake in the open until it has an owner and a due date; “intermittent” without an assignee is how Gateways return to silence two minors later.
As a closing habit, rotate the engineer who performs the checklist each release so playbook knowledge spreads beyond a single specialist. Rotation also catches assumptions baked into personal shell aliases that never appear in documentation but still shape how smoke tests are run.
Keep the checklist itself under version control beside infrastructure code so diffs tell the story of how your Gateway maturity actually evolved quarter to quarter.