OPENCLAW_2026
DOCKER_WS_
TOKEN_PAIRING_
COMPOSE_REMOTE.

// Pain: Compose is up, but host openclaw keeps returning Gateway closed (1008) / pairing required or logs show token mismatch even after you edited openclaw.json. Outcome: A symptom-to-root-cause matrix, a five-step alignment runbook, and three citeable thresholds so Docker WebSocket loopbacks, pairing, and silent OPENCLAW_GATEWAY_TOKEN overrides become an auditable checklist, plus remote Mac launchd vs container boundaries. Structure: pain, matrix, five steps, compose, thresholds, FAQ, case study, closure. Further reading: official install.sh, npm / Docker / pnpm matrix, Gateway attack surface, migration and re-pairing, SSH vs VNC remote Mac, plans and help.

Network infrastructure and connectivity

1. Pain decomposition: curl from the wrong namespace lies to you

(1) WS loopback confusion: ws://127.0.0.1:18789 is correct on the host when the port is published, but inside another container it loops to itself. If the gateway listens on service name openclaw-gateway, hard-coded 127.0.0.1 never reaches it. (2) Pairing vs token: pairing required is an authorization session problem; token mismatch is a secret mismatch; mixing them creates random edits. (3) Environment overrides files: a single OPENCLAW_GATEWAY_TOKEN line in Compose can override mounted gateway.auth.token, producing "I changed JSON but nothing moved". (4) Remote Mac dual stacks: launchd plus compose sharing the same port declaration or the same state directory yields intermittent silent failures.

2. Symptom matrix: triage transport before auth before channels

Signal Hypothesis First action
1006 / reset Wrong host or port conflict From the same network namespace as the CLI, run nc -vz host port
1008 pairing required Pairing incomplete Open control UI, issue pairing code, run CLI pair command
token mismatch Dual sources of truth Print env inside the CLI container; reconcile with volume JSON
Flaky success Two instances racing on state docker ps plus volume table; shrink to one gateway

3. Five-step runbook: from ping to a successful channels probe

  1. Freeze namespaces: Document whether CLI runs on host, sidecar, or CI runner; draw one arrow from client to listener.
  2. Explicit gateway base: Set OPENCLAW_GATEWAY_URL (or your documented equivalent) to openclaw-gateway:18789 or the published host port—never assume 127.0.0.1 across namespaces.
  3. Pairing gate: Finish pairing before testing business channels; keep two ticket templates for "unpaired" vs "wrong token".
  4. Single token source of truth: Pick env or file as canonical; make the other match or remove it.
  5. Accept with channels probe: Run one successful probe from the same environment and attach logs plus compose snippet to the change request.
# From the container that runs `openclaw` getent hosts openclaw-gateway || true nc -vz openclaw-gateway 18789 || true docker compose ps

4. When network_mode: "service:openclaw-gateway" is justified

Use it when the CLI must share the gateway network namespace to avoid brittle DNS under the default bridge. The cost is confusing localhost semantics—diagram it. Alternative: keep CLI on the host and point WS to the published host port, aligning TLS termination at the reverse proxy.

# docker-compose.override.yml sketch (adapt to your release) services: openclaw-cli: network_mode: "service:openclaw-gateway" environment: OPENCLAW_GATEWAY_TOKEN: "${GATEWAY_TOKEN}" volumes: - openclaw_state:/home/node/.openclaw:ro

5. Citeable thresholds (replace with your measurements)

Discussion-grade numbers:

  • If TCP connect success from the CLI namespace falls below 97% over 100 probes, stop feature work and fix DNS, service names, or published ports first.
  • If you see more than three token mismatch events inside a single pairing window, assume dual sources until proven otherwise.
  • If lock wait on a shared state directory exceeds 2.5s at p95 on a remote Mac, split volumes or move CLI into a dedicated sidecar.

6. Reading openclaw logs in three layers

Filter transport lines first, then auth and pairing, then channel-specific lines. Attach three screenshots to the ticket instead of narrative only. When read together with the official install.sh article, treat install ordering and network alignment as two separate acceptance gates.

Layer Question Pass example
L1 transport Did the WS handshake succeed? No unexplained 1006/1008
L2 auth Are token and pairing consistent? No unauthorized loop after pairing
L3 channels Is the channel online? channels probe returns an interpretable state

7. Remote Mac: launchd vs Docker for ports and volumes

Document who owns port 18789, whether state is a bind mount or named volume, and which unit stops first during upgrades. Pair with the SSH vs VNC guide to separate interactive VNC edits from unattended compose rolls.

8. FAQ

Will extra_hosts fix everything? Sometimes, but if the root cause is dual token sources, extra_hosts only reshuffles failure modes.

Bind gateway to 0.0.0.0? Read the attack-surface article first; public exposure is not the default answer.

Conflict with install.sh flow? No—install.sh establishes baseline order; this article focuses on Docker networking and auth alignment.

9. Case study: logs say listen, CLI still 1008

A frequent 2026 failure mode is gateway listening in container A while developers run CLI in container B with ws://127.0.0.1. It is obvious on paper yet expensive in multi-repo setups. Put the "namespace that runs openclaw" sentence at the top of README and add a CI smoke that execs probe from the runner container.

A second class is forgotten OPENCLAW_GATEWAY_TOKEN lines in override files. Pick env-or-file discipline and add a PR checklist item for token env diffs.

A third class hits remote Macs when Time Machine or sync software mirrors state directories, creating ghost second gateways. Mark state directories as non-synced or consolidate on a dedicated remote node.

Pair the migration article to separate "machine migration" drills from "compose rebuild" drills; mixing them muddles rollback stories.

Docker is repeatable delivery under stronger constraints, not automatic simplicity. Attach compose, env prints, nc output, and three-layer log snippets before claiming production readiness.

10. n8n and webhook ingress

When n8n calls back into your stack, the HTTP ingress owner must match the WS token owner. Split HTTP 502 investigations from WS 1008 tickets using distinct titles.

11. Upgrade discipline: single gateway first

During breaking releases, shrink to one gateway and one CLI before re-enabling HA. Half-upgraded volumes produce half-consistent schemas; attach doctor output to the change record.

12. Closure: Docker proves repeatability, remote Mac carries the SLO

(1) Limits: Without documenting CLI namespaces, Docker troubleshooting becomes exponential; dual token sources make config edits feel random.

(2) Why remote Mac pools help: Dedicated hosts pin compose versions, volumes, and health checks away from developer laptops.

(3) MACGPU fit: If you want a low-friction trial of aligned remote Mac gateways instead of laptop servers, MACGPU offers rentable nodes and public help entry points; the CTA below links to plans without login.

(4) Final gate: Do not claim production readiness without a successful channels probe plus compose/env/token alignment evidence.

13. Links to install matrix and install.sh

If install.sh is not green yet, read that article first. If green but Docker fails, restart from the L1 matrix here. The three-stack matrix helps decide whether CLI belongs on host or in a sidecar.

14. Observability: annotate blast radius per compose project

When multiple compose projects share a host, label which project owns port 18789 and which owns webhook ingress. On-call should open a single page with that map instead of grepping Slack history.

15. Security note: token in env vs secret files

Environment variables appear in process listings and crash dumps. If your threat model forbids that, prefer secret files with correct volume permissions and remove env duplicates. Align with the Gateway attack-surface guide for bind addresses and reverse proxies.

16. CI parity: reproduce the failing namespace in GitHub Actions

Many teams only reproduce OpenClaw issues on laptops. Add a job that builds the same compose project and runs docker compose run --rm openclaw-cli openclaw channels probe so regressions in WS URLs are caught before merge. Cache images sensibly but still refresh tags when security patches land. If your CI runners cannot resolve Docker service names, document that limitation explicitly instead of silently falling back to 127.0.0.1, which hides the production failure mode.

17. Reverse proxies and WebSocket upgrade headers

When TLS terminates at nginx or Caddy, verify upgrade headers, idle timeouts, and buffering settings. A proxy that buffers WebSocket frames can look like random 1006 closes under load. Capture proxy access logs alongside OpenClaw logs and correlate timestamps. If you terminate TLS on the host while Gateway stays plain HTTP inside Docker, keep that diagram in the repo so new hires do not reintroduce double encryption or wrong scheme in OPENCLAW_GATEWAY_URL.

18. Runbook handoff template for on-call

Paste the following into your incident template: compose project name, image tags, WS URL as seen from the failing namespace, output of env grep for OPENCLAW_ variables, output of volume path for gateway.auth.token, result of nc, and last twenty lines of gateway logs filtered by level. When legal or procurement blocks cloud GPUs, a documented remote Mac pool remains a pragmatic way to add stable Apple Silicon capacity for always-on agents without capital purchases.

19. Version skew between CLI and Gateway images

Splitting CLI and Gateway across different image tags is valid, but only when your release notes explicitly support that matrix. If the CLI adds a handshake field that the gateway build does not understand, you will see opaque disconnects that resemble networking bugs. Pin both images to the same release train during incidents, then bisect. Record the digests in your incident doc, not only floating tags. For remote Mac fleets, automate a weekly digest check so drift does not accumulate silently while engineers focus on feature work.

20. Documentation debt is operational debt

Every hour spent rediscovering 127.0.0.1 semantics is an hour not spent improving prompts or tools. Invest once in a one-page topology and link it from README, runbooks, and onboarding. When executives ask whether OpenClaw is production ready, hand them the probe log and the compose digest, not a verbal assurance. Remote Mac rental does not remove the need for good docs, but it does give you a stable place to host the canonical stack that matches what production actually runs.

21. Post-incident review prompts

After closing a gateway incident, answer five questions in writing: which namespace initiated the CLI, which file or env variable ultimately held the token, whether pairing was attempted before token edits, whether any proxy sat on the path, and which compose digest was live during the outage. Missing any one answer usually means the incident will repeat under a different symptom string next quarter.

Finally, treat macOS minor upgrades on remote hosts as semver bumps for your stack: rerun probe and nc checks after each upgrade window before declaring the fleet healthy again.