OPENCLAW_2026
GATEWAY_
SYSTEMD_
LAUNCHD_RUNBOOK.

// Pain: Channels work in a demo shell, yet production dies the moment the laptop sleeps or the VPS reboots. Linux systemd and macOS launchd have different contracts for restarts, logging, and identity. Outcome: A comparison matrix, five-step hardening path, rollback pack recipe, and a fixed diagnostic ladder from openclaw status to openclaw channels status --probe. Shape: pain split, matrix, steps, metrics, decision table, case study, FAQ. See onboard and daemon logs, Docker production, common errors, remote Mac deploy, plans.

Operations dashboard and automation

1. Pain split: install works, supervise fails

(1) Confusing demo success with production topology: Running openclaw gateway in an interactive terminal proves binaries and tokens, not restart policy. Without a single supervisor, every reboot or OOM returns you to pager duty. (2) OS semantics differ: VPS fleets care about journald, cgroup limits, and unattended security updates; Mac fleets care about lid-close sleep, App Nap, and whether your daemon is a LaunchDaemon versus a LaunchAgent tied to a GUI session. (3) Upgrades without rollback: Fast-moving OpenClaw releases plus Node/Python drift can desynchronize Gateway and channel adapters; logs then look like "WhatsApp is broken" when the real fault is a mismatched binary pair.

Teams that skip a written supervisor contract usually accumulate "zombie" gateways: an old foreground process still bound to a port while systemd tries to spawn a new one, yielding flaky TLS or random 502s from an edge proxy. The fix is operational hygiene: exactly one owner process, explicit bind addresses, and a runbook line that says "stop supervisor before manual debug."

Another subtle class of failures is configuration dual-sourcing: one engineer edits openclaw.json on the server while another re-runs onboard locally and syncs a different path. Channels appear healthy when the last writer happens to be correct. Git with secret filtering, plus immutable deploy artifacts, eliminates that roulette.

Finally, observability debt shows up as emotional debugging: people restart random services instead of reading ordered logs. A fixed ladder keeps mean time to innocence low and prevents model API keys from being rotated unnecessarily.

Operational maturity also means defining SLOs that matter to users, not engineers. "Gateway process up" is insufficient if half your channels are red because OAuth tokens expired. Pair process health with synthetic probes that send a harmless test message through each critical channel nightly. Record failures in a spreadsheet; patterns become obvious after two weeks where gut feelings stay blind.

On Linux, unattended upgrades can restart services in the middle of long-running jobs. Pin automatic security patches behind maintenance windows or use transactional updates if your distro supports them. On Mac, App Store and Xcode updates can reorder frameworks; snapshot before major OS bumps and rehearse Gateway start on a clone if possible.

2. Matrix: Linux systemd vs macOS launchd

DimensionLinux VPS + systemdRemote Mac + launchd
Primary winCheap static egress, no sleep, familiar cloud automationApple toolchains, Shortcuts, media workflows, unified memory for sidecar LLMs
SupervisionUnit files, Restart=always, backoff, journalctlLaunchDaemon vs LaunchAgent; watch user vs system domain
Power storyCPU credits, noisy neighborsLid sleep, Wi-Fi power save, need WOL or AC policy
RollbackPackage pin + tarball of config dirAPFS snapshot or parallel install trees + plist pin
Debug anchorssystemctl status, ss -lntplaunchctl print, log show

2b. Single supervisor rule

Production should enforce: either systemd/launchd owns Gateway, or a human owns it—never both simultaneously. Foreground debugging is valid; just stop the unit first and document the port release. This single rule removes a surprising share of heisenbugs in multi-channel setups.

Document the stop/start sequence in your incident template: which unit name, which launchctl bootout label, which port should be free, and how long to wait before declaring success. Seconds of ambiguity become hours when multiple responders act concurrently.

3. Five-step hardening path

  1. Freeze source of truth: version string, config path, and change log entry.
  2. Baseline health: openclaw status, openclaw gateway status, openclaw channels status --probe.
  3. Author supervisor: systemd unit with EnvironmentFile and ulimits; plist with WorkingDirectory and stdout/stderr paths.
  4. Add guardrails: rate-limit restarts; alert if >3 crashes in 5 minutes.
  5. Ship rollback pack: tarball of config + pinned binary checksum before upgrades.

Step two is non-negotiable: if probes fail while you are "fixing" launchd, you are tuning the wrong layer. Channels must be green under manual invocation before automation wraps them.

Step three should include explicit User= and Group= on Linux and a dedicated service account on Mac when possible, so file permissions on skills directories do not flap between admin and login user.

Step four protects upstream APIs from crash loops that burn quota and trigger vendor-side throttles—often misread as "model outage."

Between steps three and four, schedule a game day: intentionally kill the Gateway process and verify the supervisor restarts it, logs the event, and channels recover without manual touch. If recovery requires human typing, your automation is incomplete. Repeat the drill after every major OpenClaw upgrade; regressions hide in minor flag changes.

# Layered diagnostics (adjust paths to your install) # openclaw status # openclaw gateway status # openclaw logs --follow # openclaw doctor # openclaw channels status --probe

When doctor is clean yet replies stall, escalate to channel policy: allowlists, mention rules, pending pairings, and rate limits. Doctor validates environment; probes validate product behavior.

Correlate application logs with kernel timestamps: on Mac, power assertions and Wi-Fi roam events often precede silent channel drops that look like application bugs. On Linux, OOM killer messages in dmesg explain sudden child process loss better than application stack traces alone.

4. Citeable SRE thresholds

Numbers for on-call docs:

  • Configure restart backoff (e.g. start at 5s, cap at 60s) to avoid thundering herds against providers.
  • If unattended windows show all channels red more than twice a week, suspect sleep, DNS, or TLS renewal before blaming models.
  • Teams without dedicated ops often reduce TCO by renting a supervised remote Mac with a pinned image instead of self-managing multiple Linux boxes.

5. Decision: Linux edge vs remote Mac brain

SignalMove
Only need public ingress and webhooks, headless OKLinux + systemd + WAF/TLS edge
Workflow mixes Apple media, files, or automationRemote Mac + launchd; Linux only as optional reverse proxy
No launchd/systemd literacyBuy managed remote Mac hours with monitoring instead of DIY fleet
Compliance wants read-only rootfsContainer path (see Docker guide) but still one Gateway supervisor

Use the table as a pre-mortem: if you check multiple rows at once, split roles across machines explicitly and document which host owns the token vault.

Cost analysis must include engineer hours: a slightly pricier Mac hour that removes cross-OS file sync scripts often wins on calendar time.

MACGPU fits teams that want Apple-native behavior without capital expense: you rent a supervised remote Mac, keep Gateway and creative tooling colocated, and offload rack noise. It is not magic—just operational isolation with a predictable BOM—but that predictability is what small agencies trade for when hiring a full-time SRE is unrealistic.

If you must straddle both worlds, document the data plane explicitly: which host owns long-lived WebSocket sessions, which host signs Apple-specific payloads, and how TLS client certificates are renewed. Ambiguity there creates the worst outages because neither team believes it owns the fix.

Keep a changelog entry for every plist or unit tweak—even one line. Future you will not remember why ThrottleInterval changed, but git blame will thank you when regressions appear months later.

When in doubt, measure twice: capture before/after openclaw status --all output in tickets so reviewers see the delta without SSH access.

That discipline turns noisy chat threads into searchable knowledge—cheap insurance against repeating the same outage story.

6. FAQ

This section condenses questions collected from production OpenClaw operators in early 2026. Treat answers as templates: adapt file paths, user names, and security policies to your org.

Q: Doctor green, no replies? Probe channels; verify mention policies and allowlists. Q: Two users on one Mac? Separate HOME paths; never share writable config. Q: Upgrade killed gateway only? Compare listening ports, stale PIDs, and env files package managers overwrite.

Q: Should I run Gateway in Docker on Mac? Possible, but volume mapping for credentials and IPC adds complexity; prefer native supervision unless you already standardized on containers. Q: IPv6-only VPS? Ensure DNS resolution and certificate SANs match; half-open tunnels often trace to AAAA misconfig. Q: Logging volume too high? Ship structured JSON to a collector; avoid tailing raw logs in production shells.

Q: How do I test failover between two hosts? Use a staging hostname and practice rollback twice before touching production tokens. Q: Secrets in Git? Use sops or sealed secrets; never commit raw API keys even in private repos—forks and CI logs leak them.

Q: journald rotated away my evidence? Forward structured logs to a cheap object store or Loki; on-call should not rely on ephemeral buffers. Q: launchd says loaded but process absent? Check exit codes in log show; plist typos often fail silently until you inspect exit status. Q: Can I run Gateway as root? Avoid; use least privilege and capabilityBoundingSet on Linux or dedicated service users on macOS.

Q: Multi-region? Pin DNS TTL low during migration, validate TLS chains per region, and avoid shared state on NFS without file locking you understand. Q: Canary upgrades? Run dual versions on different ports with separate tokens in staging before flipping production DNS.

7. Case study: layered diagnostics save sleep

Incident timelines in 2026 repeat: lid close or provider maintenance → Gateway down → users tweak models for hours. A frozen five-step ladder cuts mean time to recovery because it stops random config churn. The expensive mistakes are almost always operational, not algorithmic.

Dual-source configs create ghost successes: whichever machine wrote last wins until it does not. GitOps-style deploys with checksum verification on boot make drift visible immediately.

For Apple-heavy workflows, pushing Gateway to a remote Mac aligns permissions with how OpenClaw tutorials assume the filesystem behaves, reducing sandbox surprises. Pair that with an edge Linux node only if you must terminate TLS at scale; otherwise keep topology boring.

Long term, teams that invest in runbooks beat teams that chase five percent faster tokens. A one-page doc listing ports, health commands, and rollback tar paths pays rent every outage.

Security note: assume scanners hit any accidentally exposed management port. Bind management to loopback, terminate TLS at a controlled proxy, and rotate credentials on the same cadence as channel upgrades.

Capacity planning: Gateway CPU is rarely the long pole; network stalls and provider rate limits are. Measure p95 of channel probe latency weekly; a creeping trend often precedes hard failures. Pair that with disk usage on model caches—runaway skill downloads fill small VPS disks faster than operators expect.

Change management: treat plist and unit edits like code review items. A missing WorkingDirectory line can change relative paths for skills and break imports that worked in an interactive shell. Diff plists before reload; keep previous copies with .bak timestamps.

Vendor interactions: when opening tickets with model providers, attach Gateway version, probe output, and redacted config snippets—support teams close "no repro" tickets faster when logs are ordered and minimal.

People process: rotate on-call weekly but keep a single "gateway owner" architect who approves structural changes. Distributed ownership without a tie-breaker yields conflicting restart policies.

8. Closing: Linux for ingress, Mac for Apple-native glue

(1) Limits: Linux is weak at Apple-only glue; Mac is weaker at the cheapest raw egress story. (2) Remote Mac value: unified supervision story with creative stacks and fewer cross-OS file hops. (3) MACGPU: rent a predictable Apple Silicon node instead of building a closet datacenter—use the CTA for public plans and help.

Hybrid stacks are fine: generate tokens on Linux, execute Apple workflows on Mac, but automate the boundary with rsync or object storage—not drag-and-drop over flaky tunnels.

If you are experimenting, start with manual health checks, then systemd or launchd, then alerting. Skipping the middle step guarantees midnight pages.

Budgeting reality: engineering time dominates small-team TCO. A rented remote Mac with a pinned image and monitoring often beats a "free" VPS that consumes nights tuning cgroups. Conversely, if your team is all Linux-native and never touches Apple APIs, do not force Mac just because blogs say so—match OS to workflow.

Training: record a five-minute screen capture of your healthy probe sequence and store it next to the runbook. New hires replay it before touching production. The video costs less than one outage hour.

Finally, celebrate boring weeks. The best OpenClaw fleets are uneventful: graphs flat, probes green, upgrades rehearsed. Excitement belongs in demos, not pager duty.