1. Pain decomposition: curl from the wrong namespace lies to you
(1) WS loopback confusion: ws://127.0.0.1:18789 is correct on the host when the port is published, but inside another container it loops to itself. If the gateway listens on service name openclaw-gateway, hard-coded 127.0.0.1 never reaches it. (2) Pairing vs token: pairing required is an authorization session problem; token mismatch is a secret mismatch; mixing them creates random edits. (3) Environment overrides files: a single OPENCLAW_GATEWAY_TOKEN line in Compose can override mounted gateway.auth.token, producing "I changed JSON but nothing moved". (4) Remote Mac dual stacks: launchd plus compose sharing the same port declaration or the same state directory yields intermittent silent failures.
2. Symptom matrix: triage transport before auth before channels
| Signal | Hypothesis | First action |
|---|---|---|
1006 / reset |
Wrong host or port conflict | From the same network namespace as the CLI, run nc -vz host port |
1008 pairing required |
Pairing incomplete | Open control UI, issue pairing code, run CLI pair command |
token mismatch |
Dual sources of truth | Print env inside the CLI container; reconcile with volume JSON |
| Flaky success | Two instances racing on state | docker ps plus volume table; shrink to one gateway |
3. Five-step runbook: from ping to a successful channels probe
- Freeze namespaces: Document whether CLI runs on host, sidecar, or CI runner; draw one arrow from client to listener.
- Explicit gateway base: Set
OPENCLAW_GATEWAY_URL(or your documented equivalent) toopenclaw-gateway:18789or the published host port—never assume 127.0.0.1 across namespaces. - Pairing gate: Finish pairing before testing business channels; keep two ticket templates for "unpaired" vs "wrong token".
- Single token source of truth: Pick env or file as canonical; make the other match or remove it.
- Accept with channels probe: Run one successful probe from the same environment and attach logs plus compose snippet to the change request.
4. When network_mode: "service:openclaw-gateway" is justified
Use it when the CLI must share the gateway network namespace to avoid brittle DNS under the default bridge. The cost is confusing localhost semantics—diagram it. Alternative: keep CLI on the host and point WS to the published host port, aligning TLS termination at the reverse proxy.
5. Citeable thresholds (replace with your measurements)
Discussion-grade numbers:
- If TCP connect success from the CLI namespace falls below 97% over 100 probes, stop feature work and fix DNS, service names, or published ports first.
- If you see more than three
token mismatchevents inside a single pairing window, assume dual sources until proven otherwise. - If lock wait on a shared state directory exceeds 2.5s at p95 on a remote Mac, split volumes or move CLI into a dedicated sidecar.
6. Reading openclaw logs in three layers
Filter transport lines first, then auth and pairing, then channel-specific lines. Attach three screenshots to the ticket instead of narrative only. When read together with the official install.sh article, treat install ordering and network alignment as two separate acceptance gates.
| Layer | Question | Pass example |
|---|---|---|
| L1 transport | Did the WS handshake succeed? | No unexplained 1006/1008 |
| L2 auth | Are token and pairing consistent? | No unauthorized loop after pairing |
| L3 channels | Is the channel online? | channels probe returns an interpretable state |
7. Remote Mac: launchd vs Docker for ports and volumes
Document who owns port 18789, whether state is a bind mount or named volume, and which unit stops first during upgrades. Pair with the SSH vs VNC guide to separate interactive VNC edits from unattended compose rolls.
8. FAQ
Will extra_hosts fix everything? Sometimes, but if the root cause is dual token sources, extra_hosts only reshuffles failure modes.
Bind gateway to 0.0.0.0? Read the attack-surface article first; public exposure is not the default answer.
Conflict with install.sh flow? No—install.sh establishes baseline order; this article focuses on Docker networking and auth alignment.
9. Case study: logs say listen, CLI still 1008
A frequent 2026 failure mode is gateway listening in container A while developers run CLI in container B with ws://127.0.0.1. It is obvious on paper yet expensive in multi-repo setups. Put the "namespace that runs openclaw" sentence at the top of README and add a CI smoke that execs probe from the runner container.
A second class is forgotten OPENCLAW_GATEWAY_TOKEN lines in override files. Pick env-or-file discipline and add a PR checklist item for token env diffs.
A third class hits remote Macs when Time Machine or sync software mirrors state directories, creating ghost second gateways. Mark state directories as non-synced or consolidate on a dedicated remote node.
Pair the migration article to separate "machine migration" drills from "compose rebuild" drills; mixing them muddles rollback stories.
Docker is repeatable delivery under stronger constraints, not automatic simplicity. Attach compose, env prints, nc output, and three-layer log snippets before claiming production readiness.
10. n8n and webhook ingress
When n8n calls back into your stack, the HTTP ingress owner must match the WS token owner. Split HTTP 502 investigations from WS 1008 tickets using distinct titles.
11. Upgrade discipline: single gateway first
During breaking releases, shrink to one gateway and one CLI before re-enabling HA. Half-upgraded volumes produce half-consistent schemas; attach doctor output to the change record.
12. Closure: Docker proves repeatability, remote Mac carries the SLO
(1) Limits: Without documenting CLI namespaces, Docker troubleshooting becomes exponential; dual token sources make config edits feel random.
(2) Why remote Mac pools help: Dedicated hosts pin compose versions, volumes, and health checks away from developer laptops.
(3) MACGPU fit: If you want a low-friction trial of aligned remote Mac gateways instead of laptop servers, MACGPU offers rentable nodes and public help entry points; the CTA below links to plans without login.
(4) Final gate: Do not claim production readiness without a successful channels probe plus compose/env/token alignment evidence.
13. Links to install matrix and install.sh
If install.sh is not green yet, read that article first. If green but Docker fails, restart from the L1 matrix here. The three-stack matrix helps decide whether CLI belongs on host or in a sidecar.
14. Observability: annotate blast radius per compose project
When multiple compose projects share a host, label which project owns port 18789 and which owns webhook ingress. On-call should open a single page with that map instead of grepping Slack history.
15. Security note: token in env vs secret files
Environment variables appear in process listings and crash dumps. If your threat model forbids that, prefer secret files with correct volume permissions and remove env duplicates. Align with the Gateway attack-surface guide for bind addresses and reverse proxies.
16. CI parity: reproduce the failing namespace in GitHub Actions
Many teams only reproduce OpenClaw issues on laptops. Add a job that builds the same compose project and runs docker compose run --rm openclaw-cli openclaw channels probe so regressions in WS URLs are caught before merge. Cache images sensibly but still refresh tags when security patches land. If your CI runners cannot resolve Docker service names, document that limitation explicitly instead of silently falling back to 127.0.0.1, which hides the production failure mode.
17. Reverse proxies and WebSocket upgrade headers
When TLS terminates at nginx or Caddy, verify upgrade headers, idle timeouts, and buffering settings. A proxy that buffers WebSocket frames can look like random 1006 closes under load. Capture proxy access logs alongside OpenClaw logs and correlate timestamps. If you terminate TLS on the host while Gateway stays plain HTTP inside Docker, keep that diagram in the repo so new hires do not reintroduce double encryption or wrong scheme in OPENCLAW_GATEWAY_URL.
18. Runbook handoff template for on-call
Paste the following into your incident template: compose project name, image tags, WS URL as seen from the failing namespace, output of env grep for OPENCLAW_ variables, output of volume path for gateway.auth.token, result of nc, and last twenty lines of gateway logs filtered by level. When legal or procurement blocks cloud GPUs, a documented remote Mac pool remains a pragmatic way to add stable Apple Silicon capacity for always-on agents without capital purchases.
19. Version skew between CLI and Gateway images
Splitting CLI and Gateway across different image tags is valid, but only when your release notes explicitly support that matrix. If the CLI adds a handshake field that the gateway build does not understand, you will see opaque disconnects that resemble networking bugs. Pin both images to the same release train during incidents, then bisect. Record the digests in your incident doc, not only floating tags. For remote Mac fleets, automate a weekly digest check so drift does not accumulate silently while engineers focus on feature work.
20. Documentation debt is operational debt
Every hour spent rediscovering 127.0.0.1 semantics is an hour not spent improving prompts or tools. Invest once in a one-page topology and link it from README, runbooks, and onboarding. When executives ask whether OpenClaw is production ready, hand them the probe log and the compose digest, not a verbal assurance. Remote Mac rental does not remove the need for good docs, but it does give you a stable place to host the canonical stack that matches what production actually runs.
21. Post-incident review prompts
After closing a gateway incident, answer five questions in writing: which namespace initiated the CLI, which file or env variable ultimately held the token, whether pairing was attempted before token edits, whether any proxy sat on the path, and which compose digest was live during the outage. Missing any one answer usually means the incident will repeat under a different symptom string next quarter.
Finally, treat macOS minor upgrades on remote hosts as semver bumps for your stack: rerun probe and nc checks after each upgrade window before declaring the fleet healthy again.