Multi-Agent
ORCH_
PROD_
METAL_2026.
Single-agent LLM pipeline ломается на scale: context window overflow, capability dilution, serial latency, SPOF. Benchmark data: Google distributed MAS — 6× speedup (1h→10min). AdaptOrch — topology beats model selection by 12–23% on SWE-bench. Этот hardcore guide: MAS primitives → 6 orchestration patterns (LangGraph Send API, reducers) → framework matrix → MCP+A2A wire format → PostgresSaver checkpointing → CircuitBreaker state machine → OpenTelemetry trace propagation → MAST failure taxonomy → decision tree → 2026 trends → 5-step runbook → metrics → Metal unified memory case study.
1. Single-agent bottleneck: root cause analysis
| Bottleneck | Mechanism | Production impact |
|---|---|---|
| Context window saturation | Intermediate artifacts fill KV cache | Reasoning quality cliff after step 4–5 |
| Capability dilution | One prompt handles RAG + codegen + review | Error rate +40–60% vs specialized agents |
| Serial execution | Total latency = Σ(step_latency) | No fan-out parallelism |
| Single point of failure | No circuit isolation | One hallucination cascades entire graph |
MLflow 2026 report + Google Agent Bake-Off: distributed multi-agent → 1h → 10min. AdaptOrch (2026): orchestration topology > base model, SWE-bench uplift 12–23%.
2. MAS primitives
2.1 Formal definition
MAS = N independent LLM agents + explicit communication protocol + orchestration layer, solving tasks infeasible for monolithic agent due to context/compute/specialization constraints.
| Property | Implementation requirement |
|---|---|
| Role isolation | Single responsibility: retrieve | reason | generate | validate |
| Tool scoping | Per-agent MCP server binding, least-privilege tool list |
| State isolation | Separate TypedDict/checkpoint thread per agent subgraph |
| Hot-swappability | Agent Card versioning via A2A /.well-known/agent.json |
2.2 Control topology
3. Six orchestration patterns (covers 95%+ prod workloads)
Pattern 1: Sequential pipeline (StateGraph linear DAG)
Strict edge ordering: [retriever] → [analyzer] → [writer] → [reviewer] → [output]. Zero conditional edges — maximum auditability for compliance pipelines.
Latency = Σ(T_i). Failure mode: any node exception blocks downstream — wrap with CircuitBreaker per node.
Pattern 2: Parallel fan-out / fan-in (Send API + reducer)
Supervisor emits Send() commands → N worker subgraphs execute concurrently → reducer aggregates via Annotated[list, operator.add]. Latency = max(T1…Tn) + merge overhead.
Critical: LangGraph Send API spawns parallel branches in single invoke(); reducer eliminates manual mutex/lock on shared state.
Pattern 3: Hierarchical supervisor-worker (two-tier routing)
Layer 1: keyword hash-map routing (<1ms, zero LLM tokens). Layer 2: LLM classifier fallback for ambiguous intent.
Pattern 4: Swarm / network (AutoGen GroupChat)
P2P message passing, no central coordinator. Termination via max_round hard limit — mandatory for prod to prevent infinite token burn.
⚠️ Non-deterministic. Prefer hierarchical supervisor for prod SLA-bound workloads.
Pattern 5: Blackboard architecture
Shared structured workspace (Redis/Postgres JSONB). Agents poll preconditions → read/write blackboard slots. No explicit scheduler — event-driven activation. Optimal for hour/day-scale async pipelines with heterogeneous agent types.
Pattern 6: Hybrid (production default)
Typical stack: Intent Router → Supervisor → parallel research fan-out + QA sequential pipeline. Simple queries bypass multi-agent graph (direct LLM response); complex reports traverse full orchestration chain.
4. Framework matrix: LangGraph vs CrewAI vs AutoGen
| Dimension | LangGraph | CrewAI | AutoGen |
|---|---|---|---|
| Paradigm | State machine graph (Pregel-like) | Role-based crew | Conversational multi-agent |
| Languages | Python / JS/TS | Python | Python / .NET |
| State persistence | PostgresSaver / SQLite natively | Custom impl required | Limited checkpoint support |
| HITL | interrupt() + resume API | Custom | HumanProxyAgent |
| Observability | LangSmith OTel export | Minimal | Azure Monitor integration |
| Production readiness | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Rapid prototype | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Azure stack | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
LangGraph: finance/healthcare compliance, complex conditional loops, PostgresSaver checkpointing. CrewAI: 1–2 day MVP with role YAML. AutoGen: Azure-native, multi-round debate reasoning.
5. Communication stack: MCP (vertical) + A2A (horizontal)
5.1 MCP wire protocol
JSON-RPC 2.0 over stdio or Streamable HTTP+SSE. Runtime discovery via tools/list. See MCP Server development guide.
5.2 A2A v1.0 specification
Google open-source April 2025, v1.0 GA 2026, 50+ ecosystem partners. Agent Card at /.well-known/agent.json (capabilities, input/output schemas). Task delegation: JSON-RPC 2.0 message/send with correlation_id for trace propagation.
6. Production engineering
6.1 Checkpointing & resume (PostgresSaver)
6.2 Human-in-the-Loop (interrupt/resume)
6.3 CircuitBreaker state machine
Three-state FSM: CLOSED → (failures ≥ threshold) → OPEN → (recovery_timeout elapsed) → HALF_OPEN → (probe success) → CLOSED. Config: failure_threshold=5, recovery_timeout=60s. Prevents agent cascade failure when upstream LLM API returns 429/5xx.
6.4 Token budget enforcement
7. Observability: MAST failure taxonomy
MAST team analyzed 1642 execution traces. Failure distribution:
| Failure class | Share | Root cause examples |
|---|---|---|
| System design defects | 41.77% | Duplicate steps, wrong tool selection, context overflow, missing termination |
| Inter-agent misalignment | 36.94% | Handoff context loss, hallucination propagated as ground truth |
| Task verification failure | 21.30% | Premature termination, incomplete validation gate |
57% orgs run agents in prod; only 8% implemented LLM observability — silent failures return HTTP 200 with wrong output.
SLO targets: task_success_rate >85%, e2e_latency_p95 <30s, agent_error_rate <5%, output_quality_score (LLM-as-Judge 1–5 scale). OpenTelemetry correlation_id propagated across all agent spans via W3C traceparent header.
8. Anti-patterns & guardrails
❌ Context contamination — Schema validation at every handoff; reject payloads with confidence_score <0.7. ❌ Infinite loop / token burn — MAX_ITERATIONS=10, MAX_TOOL_CALLS=20, MAX_TOTAL_TOKENS=50,000; LangGraph interrupt_before=["high_cost_tool"]. ❌ Over-engineering — Optimal prod agent count: 3–8; start with sequential pipeline. ❌ Demo→prod gap — ProductionGuardrails: input cap 10,000 chars, prompt injection regex, PII redaction filter.
9. Decision tree (topology selection)
10. Summary & 2026 trends
Core axioms: ① topology > model selection; ② start with pipeline; ③ MCP+A2A is industry standard; ④ observability is mandatory; ⑤ 3–8 agents optimal.
2026 trends: federated orchestration (multi-team sub-orchestrators sharing route policies), multimodal multi-agent pipelines, adaptive topology selection (AdaptOrch), EU AI Act mandatory decision audit chain.
11. Five-step production runbook
Step 1 — Validate core value with sequential pipeline (retrieve→analyze→output). Step 2 — Apply decision tree, model in LangGraph StateGraph. Step 3 — Wire MCP tool layer + A2A Agent Card discovery. Step 4 — PostgresSaver persistence + CircuitBreaker + token budget + OpenTelemetry tracing. Step 5 — Handoff schema validation + LLM-as-Judge sampling + HITL on high-risk nodes.
12. Reference metrics
| Metric | Value |
|---|---|
| Google multi-agent speedup | 6× (1h→10min) |
| AdaptOrch topology uplift | 12–23% |
| Optimal prod agent count | 3–8 |
| Agents in prod / observability done | 57% / 8% |
| A2A ecosystem partners | 50+ |
| E2E success rate target | >85% |
13. Case study: local Mac orchestrator + remote worker cluster
Team ran LangGraph Orchestrator + 5 Worker agents (retrieve/code/analytics/review/synthesis) on MacBook Pro 32GB unified memory, each binding 2–3 MCP servers. Parallel fan-out saturated 28GB RAM → thermal throttling → P95 latency 8s→45s (Metal GPU contention with background MLX inference). Migration: Orchestrator stays local; 5 Workers + MCP cluster on remote Mac mini 64GB unified memory; A2A over HTTP delegation; PostgresSaver checkpoints on remote node. Result: P95 12s, token cost −35% (no throttle-induced retries).
Cloud VPS runs agents, but for graphics/multimedia + AI toolchain workloads parallel with Xcode/ComfyUI/Final Cut, macOS Apple Silicon unified memory delivers superior concurrent agent throughput via zero-copy Metal buffer sharing.
Local Mac = orchestration + dev validation. 7×24 production worker cluster = remote Mac node with launchd keep-alive and A2A HTTP reverse proxy pre-configured.
For stable multi-agent worker + MCP server hosting: MACGPU remote Mac nodes — unified memory for concurrent agents, launchd process supervision, A2A HTTP reverse proxy — from demo to production SLA.