Sources & citations
Every numerical or factual claim in The Voice Agent Tradeoff Triangle series, traced to a verifiable public source with a capture date. The blog is defensible only if this document is.
Cold open + Thesis
Section 1 — The pipeline, end-to-end
Chart 1 — Latency waterfall
| Stage | Value | Source |
|---|---|---|
| Sarvam STT (end-of-speech VAD) | 50 ms | sarvam.ai/blogs/asr — Fast mode TTFT <150 ms 2026-05-24 |
| Network to LLM | 30 ms | Derived: in-region (ap-south-1) RTT optimized in production deployment |
| Gemini 2.5 Flash TTFT | 590 ms | artificialanalysis.ai/models/gemini-2-5-flash/providers — median for Google AI Studio 2026-05-24 |
| LLM to TTS handoff | 30 ms | Derived: serialization + dispatch, optimized in production |
| Sarvam Bulbul TTFB | 300 ms | sarvam.ai/blogs/sarvam-edge — vendor edge claim 260 ms; cloud value observed in production 2026-05-24 |
| Client buffer + playback | 100 ms | Derived: tight playback buffer, optimized in production |
| Total | 1,100 ms | Sum of above |
Section 2 — STT
Chart 2 — STT cost vs Hindi WER
| Provider | $/min | Hindi WER % | Source |
|---|---|---|---|
| Sarvam Saaras v3 | 0.0059 | 19.31 | sarvam.ai/blogs/asr (IndicVoices top-10) |
| Sarvam Saarika v2.5 | 0.0059 | 11.81 | sarvam docs (Sarvam-internal Hindi) |
| ElevenLabs Scribe v2 Realtime | 0.0065 | 3.10 (vendor unrep.) | elevenlabs.io/speech-to-text/hindi |
| Mistral Voxtral Small | 0.006 | 7.69 | Voxtral paper Table 4 |
| Mistral Voxtral Mini | 0.003 | 9.26 | Voxtral paper Table 4 |
| AI4Bharat IndicWhisper | $0/min (self-host) | 15.00 | Vistaar (FLEURS Hindi) |
| Google STT | 0.016 | 14.30 | Vistaar Kathbath Hindi (2023) |
| Azure STT | 0.0167 | 13.60 | Vistaar Kathbath Hindi (2023) |
| OpenAI Whisper v3 | 0.006 | 16.90 | OpenAI eval, Whisper v3 |
| Soniox v4 Realtime | 0.002 | 7.40 | Soniox vendor benchmark |
Section 3 — LLM
Chart 4 — LLM intelligence vs cost vs speed
Bubble chart of voice-agent-relevant LLMs. Every data point sourced below.
| Model | $/1M out | AA Index | TTFT (s) | Source |
|---|---|---|---|---|
| Claude Haiku 4.5 (non-reasoning) | 5.00 | 31 | 0.89 | Anthropic pricing + AA leaderboard |
| Claude Sonnet 4.6 (non-reasoning) | 15.00 | 44 | 1.34 | Anthropic pricing + AA |
| GPT-5.4 mini (no reasoning) | 4.50 | 23 | 0.60 | OpenAI pricing + AA |
| GPT-5.4 nano (no reasoning) | 1.25 | 24 | 0.66 | OpenAI pricing + AA |
| GPT-5.4 (non-reasoning) | 15.00 | 35 | 0.83 | OpenAI pricing + AA |
| Gemini 3 Flash | 2.50 | 35 | 0.90 | Google AI pricing + AA |
| Gemini 2.5 Flash | 2.50 | 30 (rev.) | 0.59 | Google AI pricing + AA |
| DeepSeek V4 Flash (non-reasoning) | 0.28 | 36 | ~1.50 | DeepSeek docs + AA |
| Grok 4.1 Fast | 0.50 | 39 | ~1.30 | xAI docs + AA |
| Grok 3 mini (high) | 0.50 | 32 | 0.68 | xAI docs + AA |
| Llama 3.3 70B (Groq) | 0.79 | 14 | ~0.50 | Groq pricing + AA |
| GPT-OSS 120B (Groq) | 0.60 | 28 | ~0.55 | Groq pricing |
Section 4 — TTS
Section 5 — Four-axis radar (Chart 7)
Chart 7 normalizes 5 canonical stacks on Cost / Latency / Intelligence / Language axes (0-1 scale). The values are editorial syntheses of the underlying data verified above. Each stack's positioning is defended by the per-component citations in sections 2-4 of this audit.
| Stack | Cost | Latency | Intel | Lang | Defense |
|---|---|---|---|---|---|
| Cheap English | 0.95 | 0.85 | 0.30 | 0.10 | $98/mo (lowest), Groq 80ms TTFT, Llama AA 14 (lowest of usable), English-only |
| Premium English | 0.30 | 0.80 | 0.85 | 0.30 | $253/mo (expensive), GPT-5.4 AA 35, English + passable Hindi |
| Production Indic | 0.75 | 0.55 | 0.65 | 0.95 | $157/mo, ~1.1s end-to-end, Gemini Flash AA 35 + Sarvam Indic-native, 22 Indian languages |
| Vapi managed | 0.10 | 0.70 | 0.70 | 0.30 | $1,120/mo (most expensive), GPT-4o-mini AA 23 |
| Gemini Live e2e | 0.55 | 0.90 | 0.60 | 0.40 | $140/mo, ~700ms (vendor), 9 Indian languages but weak code-switch |
Section 6 — Costs (Chart 8)
Scenario
200 operators × 2 calls/day × 60s/call × 20 working days = 8,000 voice-minutes/month. All stacks on Indian telephony for fair comparison.
Per-stack cost decomposition
Cheap English — $98/mo
| Component | $/mo | Calc | Source |
|---|---|---|---|
| STT (Whisper.cpp self-host) | $0 | OSS, runs on VPS | whisper.cpp + AssemblyAI self-hosting guide |
| LLM (Llama 3.3 70B on Groq) | $17.50 | (3000×$0.59 + 600×$0.79)/1M × 8000 calls | groq.com/pricing |
| TTS (Piper self-host) | $0 | OSS, real-time on CPU | github.com/rhasspy/piper |
| Self-host VM | $40 | ~$40/mo VPS (DigitalOcean/Hetzner) | DigitalOcean droplet pricing |
| Telephony (Plivo India SIP) | $40 | $0.005/min × 8000 | plivo.com/sip-trunking/pricing/in |
| Total | $97.50 |
Production Indic — $157/mo
| Component | $/mo | Calc | Source |
|---|---|---|---|
| STT (Sarvam Saaras v3) | $47.20 | $0.0059/min × 8000 | sarvam.ai/api-pricing |
| LLM (Gemini 2.5 Flash) | $19.20 | (3000×$0.30 + 600×$2.50)/1M × 8000 | ai.google.dev/gemini-api/docs/pricing |
| TTS (Sarvam Bulbul v3) | $50.40 | 175 chars × 8000 × $36/M | sarvam.ai/api-pricing |
| Telephony (Plivo India) | $40 | $0.005/min × 8000 | Plivo India SIP pricing |
| Total | $156.80 |
Premium English — $253/mo
| Component | $/mo | Calc | Source |
|---|---|---|---|
| STT (Deepgram Nova-3) | $38.40 | $0.0048/min × 8000 | deepgram.com/pricing |
| LLM (GPT-5.4 full flagship) | $105.00 | Input $60 (with ~60% cache hit at 75% discount) + Output $72 → ~$105 net | openai.com/api/docs/pricing |
| TTS (ElevenLabs Flash v2.5) | $70.00 | 175 chars × 8000 × $50/M | elevenlabs.io/pricing/api |
| Telephony (Plivo India) | $40 | $0.005/min × 8000 | Plivo India SIP pricing |
| Total | $253.40 |
Why so expensive? 41% of the bill is the GPT-5.4 LLM line ($105). GPT-5.4 at $15/1M output is the flagship OpenAI model, charging 19× more per output token than Llama 3.3 70B on Groq. If you swap GPT-5.4 → GPT-5.4-mini, the LLM line drops to ~$30 and Premium English becomes ~$178.
Gemini Live e2e — $140/mo
| Component | $/mo | Calc | Source |
|---|---|---|---|
| Audio-native LLM (Gemini 3.1 Flash Live) | $100 | $0.005/min audio in (8000) + $0.018/min audio out (3333) | ai.google.dev/gemini-api/docs/pricing |
| Telephony (Plivo India) | $40 | $0.005/min × 8000 | Plivo India SIP pricing |
| Total | $140 |
Vapi managed — $1,120/mo
| Component | $/mo | Calc | Source |
|---|---|---|---|
| Platform fee (Vapi orchestration) | $400 | $0.05/min × 8000 | vapi.ai/pricing |
| LLM (GPT-4o-mini at Vapi bundled rates) | $400 | ~$0.05/min × 8000 | vapi pricing breakdown |
| STT (Deepgram via Vapi) | $40 | ~$0.005/min × 8000 | Vapi-bundled Deepgram rate |
| TTS (11Labs Turbo at Vapi bundled) | $240 | ~$0.030/min × 8000 | Vapi-bundled 11Labs rate |
| Telephony (Plivo India) | $40 | $0.005/min × 8000 | Plivo India SIP pricing |
| Total (cheapest config) | $1,120 | Published $0.15/min cheapest stack × 8000 = $1,200; minor variance is telephony |
Premium Vapi configurations (GPT-4o + 11Labs Multilingual) reach $0.30-$0.40/min = $2,400-$3,200/mo. Source: pxlpeak.com/blog/ai-tools/vapi-pricing-breakdown and multiple corroborating 2026 cost-breakdown blogs.
Sections 7-8 — Latency + Cost Optimization (generic)
These sections contain general engineering recommendations (stream every stage, persistent connections, prompt caching, region co-location, VAD tuning) without specific numbers tied to our stack. The patterns named are industry-standard and verifiable via any voice-agent engineering reference. No quantitative claims to audit individually.
Section 9 — Conclusion
The conclusion section makes no new quantitative claims; it summarizes the tradeoff framework established by the verified claims above.
Chart-by-chart citation summary
Chart 1 — Latency waterfall (cumulative ms)
All 6 stage values cited above in Section 1. Total 1,100 ms is the sum.
Chart 2 — STT cost vs Hindi WER (scatter)
10 providers, each with cited $/min and Hindi WER. Marker shape distinguishes vendor self-reported (circle), third-party benchmark (diamond), and vendor-unreplicated (hollow ring — ElevenLabs Scribe v2 3.1%).
Chart 3 — Language coverage matrix (heatmap)
Provider × language tier matrix. Each cell reflects vendor language support docs:
- Sarvam Saaras / Bulbul: sarvam.ai/blogs/asr, Bulbul docs — 22 official Indian languages.
- Smallest.ai Lightning V3.1: 15 languages, 7 Indian (no Bengali, Punjabi): smallest.ai/blog/introducing-lightning-v3.
- Deepgram Aura-2: 7 languages, no Indic: developers.deepgram.com/docs/tts-models.
- Parakeet TDT v3: 25 European languages, no Hindi: huggingface.co/nvidia/parakeet-tdt-0.6b-v3.
- ElevenLabs Scribe v2: 90+ languages, 11 Indian: elevenlabs.io/realtime-speech-to-text.
Chart 4 — LLM intelligence vs cost (bubble)
12 voice-agent-relevant LLM variants. All TTFT and Intelligence Index values from artificialanalysis.ai/leaderboards/models (May 2026 snapshot). Pricing from each vendor's official pricing page.
Chart 5 — Context window vs blended price
Same model set as Chart 4. Context windows from vendor docs; blended price computed as 0.75 × input + 0.25 × output (voice-agent-heavy input weighting).
Chart 6 — TTS price vs TTFB (scatter)
10 TTS providers. Prices from each vendor's pricing page; TTFB values mostly vendor-published with a few independent benchmarks (Async, Vexyl India). Sarvam Bulbul v3's 300 ms is a production observation since no vendor cloud number is published.
Chart 7 — Four-axis radar
5 canonical stacks normalized 0-1 on Cost / Latency / Intelligence / Language. Editorial synthesis of citations in Sections 2-6 of this audit. Specific values + defenses in the Section 5 table above.
Chart 8 — Monthly cost @ 8k voice-min
5 stacks, all on Indian telephony for fair comparison. Per-component costs all sourced in Section 6 of this audit above.
Chart 10 — Per-call cost breakdown
Derived from Chart 8's "Production Indic" stack ÷ 8,000 calls. Pure arithmetic — verifiable by summing components.
Known caveats and weaknesses
- ElevenLabs Scribe v2 3.1% Hindi WER — no third-party replication found as of capture. The blog chart explicitly tags this as "vendor, unreplicated." If a future independent benchmark contradicts this, the chart's most striking data point becomes wrong.
- Sarvam Bulbul v3 cloud TTFB — vendor doesn't publish a cloud TTFB number. The 300 ms used in the blog is a production observation consistent with the 260 ms edge claim but not third-party verified.
- Vapi cost decomposition — Vapi doesn't publish a per-component breakdown of bundled pricing. Our $400 platform + $400 LLM + $40 STT + $240 TTS decomposition was reverse-engineered from the published $0.15/min cheapest-stack figure. The decomposition is informed guess; the total ($1,120) is well-sourced.
- "Cheap English" Self-host VM at $40 — depends on actual VPS choice. DigitalOcean $40/mo droplet runs Whisper.cpp + Piper for our scenario, but for sub-300ms STT latency a GPU instance ($100-200/mo) may be needed. The chart number reflects CPU-only deployment.
- Premium English LLM line ($105) — defensible only if you intend "premium" = "flagship OpenAI." Many production English voice agents use GPT-5.4-mini ($0.75/$4.50), which would drop Premium English to ~$178. The blog's $253 represents the upper edge of "premium" stack cost.
- Gemini Flash TTFT 590 ms — Artificial Analysis median for Google AI Studio. Google Vertex AI shows higher (780 ms). With aggressive implicit caching, observed production TTFT can be lower. We use the AA median to avoid optimistic claims.
Source index (all URLs cited)
Vendor pricing pages
- Sarvam: sarvam.ai/api-pricing
- Google Gemini: ai.google.dev/gemini-api/docs/pricing
- OpenAI: developers.openai.com/api/docs/pricing
- Anthropic: platform.claude.com/docs/en/about-claude/pricing
- DeepSeek: api-docs.deepseek.com/quick_start/pricing
- Groq: groq.com/pricing
- Deepgram: deepgram.com/pricing
- AssemblyAI: assemblyai.com/pricing
- ElevenLabs: elevenlabs.io/pricing/api
- Cartesia: cartesia.ai/pricing
- Inworld: inworld.ai/pricing
- Plivo India SIP: plivo.com/sip-trunking/pricing/in
- Twilio India SIP: twilio.com/sip-trunking/pricing/in
- Vapi: vapi.ai/pricing
- Bolna: bolna.ai/pricing
Benchmarks and rankings
- Artificial Analysis (LLM leaderboard): artificialanalysis.ai/leaderboards/models
- Artificial Analysis (Gemini 2.5 Flash providers): artificialanalysis.ai/models/gemini-2-5-flash/providers
- Artificial Analysis (TTS leaderboard): artificialanalysis.ai/text-to-speech/leaderboard
- Async TTS latency benchmark: async.com/blog/tts-latency-vs-quality-benchmark
- Vexyl India TTS test: vexyl.ai/elevenlabs-tts-latency-test-2026-real-world-results
- AI4Bharat Vistaar (Hindi STT): github.com/AI4Bharat/vistaar
- Voxtral paper: arxiv.org/html/2507.13264v1
Vendor blogs / launches
- Sarvam Saaras V3: sarvam.ai/blogs/asr
- Sarvam Bulbul V3: sarvam.ai/blogs/bulbul-v3
- Sarvam Edge: sarvam.ai/blogs/sarvam-edge
- Deepgram engineering / Aura-2: deepgram.com/learn/engineering-real-time
- Inworld TTS-1.5: inworld.ai/blog/introducing-inworld-tts-1-5
- ElevenLabs Scribe v2 Hindi: elevenlabs.io/speech-to-text/hindi
- ElevenLabs models: elevenlabs.io/docs/overview/models
- Cartesia TTS models: docs.cartesia.ai/build-with-cartesia/tts-models/latest
- Anthropic prompt caching: platform.claude.com/docs/en/build-with-claude/prompt-caching
- NVIDIA Parakeet TDT v3: huggingface.co/nvidia/parakeet-tdt-0.6b-v3
- AI4Bharat IndicConformer: huggingface.co/ai4bharat/indic-conformer-600m-multilingual
- Vapi pricing breakdown 2026: pxlpeak.com/blog/ai-tools/vapi-pricing-breakdown
- Vapi Series B announcement: techcrunch.com — Vapi $500M valuation
- Air AI FTC settlement: FTC — Air AI settlement
- PlayAI Meta acquisition: techcrunch.com — Meta acquires PlayAI
OSS projects
- Whisper.cpp: github.com/ggml-org/whisper.cpp
- Piper TTS: github.com/rhasspy/piper
- AssemblyAI self-host Whisper guide: assemblyai.com/blog/self-hosting-whisper
- OpenAI Whisper v3 discussion: github.com/openai/whisper/discussions/1762
This audit verifies every numerical or factual claim in the series against publicly accessible sources captured 2026-05-24. If you spot a citation that no longer loads or a number that has shifted since capture, let us know — the charts regenerate from source data.