OpenRouter Mai 2026 Images × Context Length × Audio Charts: Gemini 3.5 Flash / Qwen3.7 Max Multimodal-Traffic & Mac-Dreispur-Routing

Auf openrouter.ai/rankings verarbeitet die Plattform nach der Series B (26.05.) rund 25T Tokens/Woche — sieben parallele Slices. Gesamtchart, Programming und Tool Calls decken Text/Agent ab — für Bild, Audio und Millionen-Token-RAG sind Images / Context Length / Audio Input maßgeblich. Releases: Gemini 3.5 Flash (19.05., 1,05M), Qwen3.7 Max (21.05., 1M), Qwen3-ASR-Flash, Gemini Embedding 2. Dieser datengetriebene Leitfaden: Bucket-Lektüre, Drei-Chart-Snapshot, Mac-Dreispur, Sechs Schritte, Matrix, Fall, Abnahme.

1. Spezifikation des Problems: Gesamt-, Programming- und Tool-Calls-Charts decken Multimodal nicht ab

Dimensionsfehler: Gesamt-#1 MiMo-V2-Pro ≠ Bild- oder Audio-Traffic. Programming misst Code-Tokens, nicht OCR/Screenshot-QA. Context-Length-Chart ≠ Modell-Fenster: OpenRouter bucketiert nach prompt+completion-Länge pro Request (Default 1K–10K), nicht nach 1M-Card-Limit. Images-Preis: Gemini 3 Flash ~$0,0005/K images; Recraft/xAI pro Bild — ein Key ohne Split → Text günstig, Bild teuer. Unified Memory: Qwen-VL 7B @ 4-bit ~6GB + 128K-KV → M2 32GB swap. Audio: whisper.cpp lokal gratis/langsam; Qwen3-ASR-Flash API/sek., Dialekte stärker — Entscheidung ≠ „nur lokal möglich“.

2. Sieben Slices: Context-Length-Buckets vs. Modellkarten-Fenster

Slice	Metrik	Typische Fehllesung	Mac-Nutzung
Images	Bildvolumen / Modellanteil	„bester Vision-LLM“	Vision-Agent, OCR, Screenshot-QA
Context Length	Traffic pro Längen-Bucket	„längstes Kontextmodell“	Kurz-Completion vs. Voll-RAG trennen
Audio Input	Audio-Prompt-Volumen	TTS-Chart	STT, Meeting, Podcast
Top Models	Wöchentliche Tokens	Universal-Default	Reiner Text (Post 25.05.)
Programming	Code-Traffic	enthält Vision-Code	IDE (Post 26.05.)
Tool Calls	tools-Requests	enthält reine Vision-tools	Agent exec (Post 27.05.)

Ops-Regel: Montags Images + Context Length (100K+-Bucket) + Audio alignen; Text-Agent weiter Tool Calls. CN-Modelle >60 % Plattform-Tokens (Branchenanalyse); Qwen-VL/Qwen3-ASR in Images/Audio steigend; Gemini 3.x führt Context-High-Buckets bei Multimodal+Lang.

3. Images-Chart-Snapshot (KW 2026-05-28, Mac-Multimodal)

Tier	Modell-ID (Beispiel)	Use Case	Mac-Pfad
T1	google/gemini-3-flash-preview, google/gemini-3.5-flash	Screenshot-QA, UI-Review, Multi-Bild-Agent	OpenRouter API; Qwen-VL 8B MLX Draft
T2	qwen/qwen3-vl-8b-instruct, google/gemma-4-31b	Auditierbar, Offline-Prototyp	MLX 4-bit @ 32K; 64GB+ stabil
T3	recraft/, x-ai/grok--image	Generierung	API; ComfyUI separat
T4	google/gemini-embedding-2	Cross-Modal-RAG	API; Vektorindex lokal/Remote

Images ∩ Gesamt <40 % Overlap: Gemini 3 Flash Preview rankt in Images oft höher als im Text-Gesamtchart (Cursor/Claude Code Screenshot-Feeds). Dashboard: Filter modalities: image, separates $/day Sub-Budget für Vision-Agent vs. Coding-Agent.

4. Context-Length-Buckets: 1K–10K vs. 100K–1M

Bucket	Request-Profil	Chart-Leader	Mac
1K–10K	Chat, Snippet	MiMo-V2-Pro, DeepSeek V4 Flash, Gemini 3 Flash	Lokal 30B oder API T1
10K–100K	Mittel-RAG, PR-Diff	Qwen3.6 Plus, Claude Sonnet 4.6, Kimi K2.6	API; lokal max ~64K
100K–1M	Volltext/Codebase	Qwen3.7 Max, Gemini 3.5 Flash, GPT-5.5	nur API; KV nicht lokal
1M+	Experiment	Llama 4 Scout (10M card)	API / Remote-Mac-Lab

Qwen3.7 Max: 21.05., 1M, $1,25/$3,75 per M — steigt in High-Bucket + Agent. Gemini 3.5 Flash: 1,05M, $1,50/$9 — dominiert „lang + multimodal“. Mac-RAG: Embedding lokal (nomic/small), Generierung nur API High-Bucket; 200-Seiten-PDF nicht in 32B lokal.

5. Audio-Input-Chart: Qwen3-ASR vs. Whisper vs. GPT-4o-transcribe

Modell	Vorteil	Abrechnung	Mac
qwen/qwen3-asr-flash	CN/Dialekt, Lyrics, Fernfeld	$/s niedrig	API Batch
openai/whisper-large-v3-turbo	Multilingual	$/s	API oder whisper.cpp
openai/gpt-4o-transcribe	LLM-Pipeline	höher	API only
MLX Whisper	0 API$, Privacy	GPU-Zeit	M2+ 32GB

Audio-Volumen < Images (~1 Größenordnung), Wachstum höchste Rate (Podcast/Meeting-Agent/OpenClaw Voice). Drei-Spur: <15min MLX lokal; Batch/Dialekt Qwen3-ASR; gleicher Kontext GPT-4o-transcribe.

6. Sechs Schritte: Drei Charts → Mac-Multimodal-Routing

Schritt 1 — Wöchentlicher Chart + Model-Card Pull

openrouter.ai/rankings: Images, Context Length (1K–10K und 100K+), Audio Input; API /api/v1/models → architecture.modality, pricing.

Schritt 2 — Vier Last-Buckets

Rein visuell / Vision+Text-Agent / Lang-RAG / STT — je Primary+Fallback, kein Single-Gemini.

Schritt 3 — Cursor / OpenClaw Vision-Route

Cursor Screenshots → Images T1; OpenClaw openclaw.json vision-primary ≠ text-primary.

Schritt 4 — RAG Split

Embed lokal/API; Generate nur Qwen3.7 Max / Gemini 3.5 Flash @ High-Bucket.

Schritt 5 — Audio Dual-Track

<15min MLX Whisper; Batch Qwen3-ASR; Queue Remote-Mac-cron.

Schritt 6 — Sub-Limits + 30-Min-Probe

Dashboard Images/Audio Caps; je Route 10 Samples: Latenz, $, OOM.

# Modality-Filter OpenRouter
curl -s "https://openrouter.ai/api/v1/models" \
  | jq '.data[] | select(.architecture.modality | index("image"))
        | {id, context_length, pricing}' \
  > /tmp/or-vision-$(date +%Y%m%d).json

curl -s https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-3.5-flash",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Summarize this 80-page PDF section."},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
      ]
    }],
    "max_tokens": 4096
  }'
                

7. Drei-Spur-Matrix: MLX lokal / OpenRouter API / Remote Mac

Szenario	Pfad	Config	Abnahme (p95)
Screenshot-QA	MLX lokal	Qwen-VL 8B :8082	<8s/Bild
Multi-Bild-Agent	API	Gemini 3.5 Flash	tool+vision >92%
200+ Seiten RAG	API	Qwen3.7 Max 1M	TTFT <12s @ 512K in
Podcast-Batch-STT	Remote+API	Qwen3-ASR Queue	10h/Nacht, 0 OOM
ComfyUI + Vision parallel	Remote 128GB	ComfyUI + macMLX	6h, 0 swap

8. Fallstudie: 4-Personen-Short-Video-Team, Multimodal −38 %

„MacBook Pro M3 36GB: UI-Review und Skript über Claude, STT GPT-4o-transcribe — OpenRouter $3.200/Monat. Nach Images/Context/Audio-Charts: UI → Gemini 3 Flash (Images T1), Brief 200 S. nur Qwen3.7 Max @ High-Bucket, STT Qwen3-ASR + MLX Whisper-Split, ComfyUI-Thumbnails → MACGPU M4 Max 128GB Nacht-Queue. 30 Tage: $1.980 (−38 %); tags kein Whisper∥Qwen-VL swap.“

Kernmetrik: Teure Modelle für billige Modalitäten — Charts mappen Plattform-Traffic auf Routing, nicht Benchmarks.

9. Prognose: Input-Modality-Charts vs. Context-Buckets divergieren

Bei 25T/Woche wird OpenRouter Infrastruktur für Vision+Audio+Million-Context. H2 2026: IDE/Agent-Frameworks mit separaten Images/Audio-Routes; Flash kurze Buckets, Qwen3.7 Max/Gemini 3.5 lange. Apple Silicon: ein Chip für MLX-Vision + VideoToolbox — Hybrid „Embed lokal + Whisper lokal + ComfyUI remote“ schlägt reine Cloud-Laptop-Setups.

32GB ohne Tag/Nacht-Split: Remote Apple Silicon (MACGPU M4 Max 128GB, macMLX+Whisper+ComfyUI) mit gleichem OpenRouter-Key — Images/Audio-Peaks im LAN-Node.

10. Referenzzahlen & FAQ

① ~25T Tokens/Woche (26.05.). ② CN-Modelle >60 %. ③ Gemini 3.5 Flash 1,05M. ④ Qwen3.7 Max 1M (21.05.). ⑤ Gemini 3 Flash Bild ~$0,0005/K images. ⑥ Fall $3.200→$1.980 (−38 %).

Gesamtchart? Ja, Multimodal primär Images/Context/Audio. Context-Chart = längstes Modell? Nein — Request-Buckets. Images-#1 lokal? Meist API; Qwen-VL 8B assistiert. MACGPU? Remote Peak: ComfyUI/Whisper-Queue; Laptop nur Dev.