Multi-Agent AI архитектура 2026: hardcore-гид от паттернов до production

Single-agent LLM pipeline ломается на scale: context window overflow, capability dilution, serial latency, SPOF. Benchmark data: Google distributed MAS — 6× speedup (1h→10min). AdaptOrch — topology beats model selection by 12–23% on SWE-bench. Этот hardcore guide: MAS primitives → 6 orchestration patterns (LangGraph Send API, reducers) → framework matrix → MCP+A2A wire format → PostgresSaver checkpointing → CircuitBreaker state machine → OpenTelemetry trace propagation → MAST failure taxonomy → decision tree → 2026 trends → 5-step runbook → metrics → Metal unified memory case study.

1. Single-agent bottleneck: root cause analysis

Bottleneck	Mechanism	Production impact
Context window saturation	Intermediate artifacts fill KV cache	Reasoning quality cliff after step 4–5
Capability dilution	One prompt handles RAG + codegen + review	Error rate +40–60% vs specialized agents
Serial execution	Total latency = Σ(step_latency)	No fan-out parallelism
Single point of failure	No circuit isolation	One hallucination cascades entire graph

MLflow 2026 report + Google Agent Bake-Off: distributed multi-agent → 1h → 10min. AdaptOrch (2026): orchestration topology > base model, SWE-bench uplift 12–23%.

2. MAS primitives

2.1 Formal definition

MAS = N independent LLM agents + explicit communication protocol + orchestration layer, solving tasks infeasible for monolithic agent due to context/compute/specialization constraints.

Property	Implementation requirement
Role isolation	Single responsibility: retrieve \| reason \| generate \| validate
Tool scoping	Per-agent MCP server binding, least-privilege tool list
State isolation	Separate TypedDict/checkpoint thread per agent subgraph
Hot-swappability	Agent Card versioning via A2A /.well-known/agent.json

2.2 Control topology

Centralized              Decentralized             Hierarchical
   [Orchestrator]           A ←→ B ←→ C               [Root Supervisor]
  /    |    \                    ↕       ↕              /           \
[A]  [B]  [C]                 D ←→ E ←→ F        [Sub-Supervisor] [Sub-Supervisor]
Deterministic routing      P2P gossip, hard debug   O(log N) delegation depth

3. Six orchestration patterns (covers 95%+ prod workloads)

Pattern 1: Sequential pipeline (StateGraph linear DAG)

Strict edge ordering: [retriever] → [analyzer] → [writer] → [reviewer] → [output]. Zero conditional edges — maximum auditability for compliance pipelines.

from langgraph.graph import StateGraph, START, END
from typing import TypedDict

class PipelineState(TypedDict):
    query: str; retrieved_docs: str; analysis: str; final_report: str

def retrieval_agent(state):
    return {"retrieved_docs": search_knowledge_base(state["query"])}

def analysis_agent(state):
    result = llm.invoke(f"Analyze: {state['retrieved_docs']}")
    return {"analysis": result.content}

builder = StateGraph(PipelineState)
builder.add_node("retriever", retrieval_agent)
builder.add_node("analyzer", analysis_agent)
builder.add_edge(START, "retriever")
builder.add_edge("retriever", "analyzer")
builder.add_edge("analyzer", END)
pipeline = builder.compile()

Latency = Σ(T_i). Failure mode: any node exception blocks downstream — wrap with CircuitBreaker per node.

Pattern 2: Parallel fan-out / fan-in (Send API + reducer)

Supervisor emits Send() commands → N worker subgraphs execute concurrently → reducer aggregates via Annotated[list, operator.add]. Latency = max(T1…Tn) + merge overhead.

from langgraph.types import Send
from typing import Annotated
import operator

class ResearchState(TypedDict):
    query: str
    research_results: Annotated[list, operator.add]
    final_synthesis: str

def supervisor(state):
    return [Send("research_worker", {"query": state["query"], "source": s})
            for s in ["academic", "industry", "news"]]

def research_worker(state):
    return {"research_results": [search_by_source(state["query"], state["source"])]}

Critical: LangGraph Send API spawns parallel branches in single invoke(); reducer eliminates manual mutex/lock on shared state.

Pattern 3: Hierarchical supervisor-worker (two-tier routing)

Layer 1: keyword hash-map routing (<1ms, zero LLM tokens). Layer 2: LLM classifier fallback for ambiguous intent.

KEYWORD_ROUTING = {"code": "code_agent", "code": "code_agent",
                   "search": "search_agent", "data": "data_agent"}

def supervisor_with_fast_path(state):
    query = state["query"].lower()
    for kw, agent in KEYWORD_ROUTING.items():
        if kw in query:
            return {"next": agent}  # fast path: O(1) lookup
    decision = llm.invoke(f"Route to best agent: {state['query']}")
    return {"next": decision.content.strip()}  # slow path: LLM inference

Pattern 4: Swarm / network (AutoGen GroupChat)

P2P message passing, no central coordinator. Termination via max_round hard limit — mandatory for prod to prevent infinite token burn.

groupchat = autogen.GroupChat(
    agents=[human_proxy, reviewer_1, reviewer_2],
    messages=[], max_round=6  # hard stop — non-negotiable in prod
)

⚠️ Non-deterministic. Prefer hierarchical supervisor for prod SLA-bound workloads.

Pattern 5: Blackboard architecture

Shared structured workspace (Redis/Postgres JSONB). Agents poll preconditions → read/write blackboard slots. No explicit scheduler — event-driven activation. Optimal for hour/day-scale async pipelines with heterogeneous agent types.

Pattern 6: Hybrid (production default)

Typical stack: Intent Router → Supervisor → parallel research fan-out + QA sequential pipeline. Simple queries bypass multi-agent graph (direct LLM response); complex reports traverse full orchestration chain.

4. Framework matrix: LangGraph vs CrewAI vs AutoGen

Dimension	LangGraph	CrewAI	AutoGen
Paradigm	State machine graph (Pregel-like)	Role-based crew	Conversational multi-agent
Languages	Python / JS/TS	Python	Python / .NET
State persistence	PostgresSaver / SQLite natively	Custom impl required	Limited checkpoint support
HITL	interrupt() + resume API	Custom	HumanProxyAgent
Observability	LangSmith OTel export	Minimal	Azure Monitor integration
Production readiness	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Rapid prototype	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Azure stack	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐

LangGraph: finance/healthcare compliance, complex conditional loops, PostgresSaver checkpointing. CrewAI: 1–2 day MVP with role YAML. AutoGen: Azure-native, multi-round debate reasoning.

5. Communication stack: MCP (vertical) + A2A (horizontal)

Agent-1 ←──── A2A (JSON-RPC 2.0) ────→ Agent-2
   │                                        │
MCP (stdio/HTTP+SSE)                   MCP (stdio/HTTP+SSE)
   ↓                                        ↓
[tools/list, tools/call]              [tools/list, tools/call]
   ↓                                        ↓
[DB / API / filesystem]               [DB / API / filesystem]

MCP: Agent ↔ external tools (vertical integration layer)
A2A: Agent ↔ Agent (horizontal delegation layer)

5.1 MCP wire protocol

JSON-RPC 2.0 over stdio or Streamable HTTP+SSE. Runtime discovery via tools/list. See MCP Server development guide.

5.2 A2A v1.0 specification

Google open-source April 2025, v1.0 GA 2026, 50+ ecosystem partners. Agent Card at /.well-known/agent.json (capabilities, input/output schemas). Task delegation: JSON-RPC 2.0 message/send with correlation_id for trace propagation.

6. Production engineering

6.1 Checkpointing & resume (PostgresSaver)

from langgraph.checkpoint.postgres import PostgresSaver
with PostgresSaver.from_conn_string("postgresql://...") as checkpointer:
    graph = builder.compile(checkpointer=checkpointer)
    config = {"configurable": {"thread_id": "user-session-12345"}}
    # crash at step 3 → resume from last checkpoint, not step 1
    result = graph.invoke({"query": "Q2 earnings analysis"}, config)

6.2 Human-in-the-Loop (interrupt/resume)

from langgraph.types import interrupt
human_decision = interrupt({
    "proposed_action": proposed_action,
    "risk_level": "HIGH",
    "message": "This will mutate production DB — confirm"
})
# graph pauses, persists state, resumes on human input

6.3 CircuitBreaker state machine

Three-state FSM: CLOSED → (failures ≥ threshold) → OPEN → (recovery_timeout elapsed) → HALF_OPEN → (probe success) → CLOSED. Config: failure_threshold=5, recovery_timeout=60s. Prevents agent cascade failure when upstream LLM API returns 429/5xx.

6.4 Token budget enforcement

class TokenBudgetManager:
    def check_budget(self, agent_id: str, estimated_tokens: int):
        if self.usage[agent_id] + estimated_tokens > self.limits[agent_id]:
            raise BudgetExceededException(agent_id)
    def record_usage(self, agent_id: str, actual_tokens: int):
        self.usage[agent_id] += actual_tokens

7. Observability: MAST failure taxonomy

MAST team analyzed 1642 execution traces. Failure distribution:

Failure class	Share	Root cause examples
System design defects	41.77%	Duplicate steps, wrong tool selection, context overflow, missing termination
Inter-agent misalignment	36.94%	Handoff context loss, hallucination propagated as ground truth
Task verification failure	21.30%	Premature termination, incomplete validation gate

57% orgs run agents in prod; only 8% implemented LLM observability — silent failures return HTTP 200 with wrong output.

SLO targets: task_success_rate >85%, e2e_latency_p95 <30s, agent_error_rate <5%, output_quality_score (LLM-as-Judge 1–5 scale). OpenTelemetry correlation_id propagated across all agent spans via W3C traceparent header.

8. Anti-patterns & guardrails

❌ Context contamination — Schema validation at every handoff; reject payloads with confidence_score <0.7. ❌ Infinite loop / token burn — MAX_ITERATIONS=10, MAX_TOOL_CALLS=20, MAX_TOTAL_TOKENS=50,000; LangGraph interrupt_before=["high_cost_tool"]. ❌ Over-engineering — Optimal prod agent count: 3–8; start with sequential pipeline. ❌ Demo→prod gap — ProductionGuardrails: input cap 10,000 chars, prompt injection regex, PII redaction filter.

9. Decision tree (topology selection)

Strict linear dependency?
├─ Yes → Subtasks parallelizable?
│        ├─ No  → 【Sequential Pipeline】
│        └─ Yes → 【Fan-out + Pipeline Hybrid】
└─ No  → Central decision agent exists?
         ├─ Yes → Sub-teams required?
         │        ├─ No  → 【Supervisor-Worker】
         │        └─ Yes → 【Hierarchical Supervisors of Supervisors】
         └─ No  → Long-running async?
                  ├─ Yes → 【Blackboard Architecture】
                  └─ No  → agents ≤5 → 【Swarm + max_round hard limit】
                           else → 【Refactor to Hierarchical】

10. Summary & 2026 trends

Core axioms: ① topology > model selection; ② start with pipeline; ③ MCP+A2A is industry standard; ④ observability is mandatory; ⑤ 3–8 agents optimal.

2026 trends: federated orchestration (multi-team sub-orchestrators sharing route policies), multimodal multi-agent pipelines, adaptive topology selection (AdaptOrch), EU AI Act mandatory decision audit chain.

11. Five-step production runbook

Step 1 — Validate core value with sequential pipeline (retrieve→analyze→output). Step 2 — Apply decision tree, model in LangGraph StateGraph. Step 3 — Wire MCP tool layer + A2A Agent Card discovery. Step 4 — PostgresSaver persistence + CircuitBreaker + token budget + OpenTelemetry tracing. Step 5 — Handoff schema validation + LLM-as-Judge sampling + HITL on high-risk nodes.

12. Reference metrics

Metric	Value
Google multi-agent speedup	6× (1h→10min)
AdaptOrch topology uplift	12–23%
Optimal prod agent count	3–8
Agents in prod / observability done	57% / 8%
A2A ecosystem partners	50+
E2E success rate target	>85%

13. Case study: local Mac orchestrator + remote worker cluster

Team ran LangGraph Orchestrator + 5 Worker agents (retrieve/code/analytics/review/synthesis) on MacBook Pro 32GB unified memory, each binding 2–3 MCP servers. Parallel fan-out saturated 28GB RAM → thermal throttling → P95 latency 8s→45s (Metal GPU contention with background MLX inference). Migration: Orchestrator stays local; 5 Workers + MCP cluster on remote Mac mini 64GB unified memory; A2A over HTTP delegation; PostgresSaver checkpoints on remote node. Result: P95 12s, token cost −35% (no throttle-induced retries).

Cloud VPS runs agents, but for graphics/multimedia + AI toolchain workloads parallel with Xcode/ComfyUI/Final Cut, macOS Apple Silicon unified memory delivers superior concurrent agent throughput via zero-copy Metal buffer sharing.

Local Mac = orchestration + dev validation. 7×24 production worker cluster = remote Mac node with launchd keep-alive and A2A HTTP reverse proxy pre-configured.

For stable multi-agent worker + MCP server hosting: MACGPU remote Mac nodes — unified memory for concurrent agents, launchd process supervision, A2A HTTP reverse proxy — from demo to production SLA.