1. Pain: TTS is not “STT in reverse”
Live prompts optimize p95 first-audio; batch VO optimizes timbre/LUFS/sibilance—without a text contract you blame “the model.” On Apple Silicon, TTS competes with VideoToolbox, WebAudio, and DAWs for unified memory. Cloud APIs churn; Piper/ONNX needs the same EP/shape hygiene as your ONNX runbook.
2. Matrix: system, offline, neural API
| Dimension | AVSpeechSynthesizer | Offline Piper-class / ONNX | Neural API (hosted) |
|---|---|---|---|
| Latency posture | Good warm; timbre drifts with OS | Batch WAV strong; watch threads | RTT+TLS; measure streaming p95 |
| Quality levers | Tooling-grade expressiveness | Pinned weights; weaker prosody | Prosody + multi-speaker; cost/residency |
| Engineering hooks | AVAudioSession route/ducking | Reuse ONNX EP/shape gates | Idempotency, backoff, SSML caps |
2b. Five-step runbook
- Freeze the text contract (numbers, SSML subset) and hash it in CI.
- Split queues for live vs overnight batch; durable job IDs.
- Define audio egress (rate, depth, LUFS) per the FFmpeg guide.
- Dual metrics: p95 first-audio and p95 RTF per sentence bucket.
- Golden sentences + checksums per engine build.
3. Citeable thresholds
Design-review numbers (re-measure locally):
- Live: <200ms p95 first-audio after 50 cold + 50 warm runs.
- Offline: RTF p95 > 0.35 with four lanes → add remote Mac workers.
- Ops: >4h/week lost to queues/throttling → remote pool beats more gadgets.
4. Split matrix: remote Mac vs stay on API
When VO collides with LLM/STT peaks, isolate workers on remote Apple Silicon (SSH/VNC guide). If audio must stay on-prem, run neural TTS on owned Macs, not random laptops. Co-hosting with ONNX: reuse EP + shape gates so CPU fallback does not stack with synthesis. Crackle is usually AVAudioSession routing, not weights.
5. FAQ
STT+TTS one process? Split queues; resample once. Remote faster? Only if GUI/memory—not text prep—was the bottleneck. Piper vs system? Pin builds; system voices drift on macOS.
6. Ops snapshot, observability, and closing
A marketing team hit healthy average RTF but bad p95 whenever Final Cut rendered while neural TTS streamed; a headless remote Mac removed GUI contention. TCO covered residency, SSML rework, and mis-read currency risk—splitting system speech / Piper batch / API templates with caching and idempotency. Rehearse worker kills to catch missing idempotency keys; align LUFS/true-peak once with mastering.
Ops charts: first-audio p95, RTF p95 buckets, swap vs NLE timelines. Reviewers need pinned builds, dictionary hashes, dual SLOs, failing text IDs + waveform checksums, and API backoff tied to invoices.
Limits → Mac → MACGPU: laptops politicize tail latency; remote Apple Silicon keeps Metal/Core Audio without GUI fights. Trial high-memory nodes via the CTA; rerun golden sentences after OS upgrades. Keep AVSpeech, Piper, and APIs on separate normalization; pair with the STT article for single-pass resampling.