Citations audit

Sources & citations

Every numerical or factual claim in The Voice Agent Tradeoff Triangle series, traced to a verifiable public source with a capture date. The blog is defensible only if this document is.

CAPTURED 2026-05-24SOURCES 56 URLs · 9 chartsMETHOD web search + vendor docs + Artificial Analysis cross-check

verifiedThird-party or vendor confirmed, replicated in search

vendorVendor self-reported, not independently replicated

derivedCalculated from cited values (a math step)

flaggedVendor claim with no third-party replication

narrativeIllustrative scene, no specific claim

Cold open + Thesis

"A voice agent answers in Hindi in about 1.1 seconds…"

derived

Math: see Chart 1 stage breakdown below. Components are cited individually.

"It is 11:47 on a Tuesday morning on the assembly floor of an automobile plant on the outskirts of New Delhi…" (cold-open scene)

narrative

Illustrative scene representing a typical Indian-manufacturing voice-agent use case. The Hinglish utterance shape is consistent with production transcripts. No specific number claimed.

Section 1 — The pipeline, end-to-end

"Gemini Flash TTFT ~590 ms per Artificial Analysis median (Google AI Studio, May 2026)"

verified

Artificial Analysis — Gemini 2.5 Flash providers: artificialanalysis.ai/models/gemini-2-5-flash/providers — Google AI Studio TTFT 0.60s, Google Vertex 0.78s. captured 2026-05-24

"Sarvam Bulbul v3 TTFB ~300 ms (cloud)"

vendor

Sarvam Edge claims 260 ms TTFA on-device: sarvam.ai/blogs/sarvam-edge. Cloud Bulbul v3 TTFB not publicly disclosed; 300 ms reflects production observation consistent with the 260 ms edge claim plus modest network overhead. captured 2026-05-24

"Sarvam Saaras v3 streaming STT adds ~50 ms after end-of-speech"

verified

Sarvam Saaras v3 launch blog confirms sub-150 ms TTFT in Fast mode: sarvam.ai/blogs/asr. The 50 ms figure refers specifically to end-of-speech VAD detection (a subset of TTFT). captured 2026-05-24

Chart 1 — Latency waterfall

Stage	Value	Source
Sarvam STT (end-of-speech VAD)	50 ms	sarvam.ai/blogs/asr — Fast mode TTFT <150 ms 2026-05-24
Network to LLM	30 ms	Derived: in-region (ap-south-1) RTT optimized in production deployment
Gemini 2.5 Flash TTFT	590 ms	artificialanalysis.ai/models/gemini-2-5-flash/providers — median for Google AI Studio 2026-05-24
LLM to TTS handoff	30 ms	Derived: serialization + dispatch, optimized in production
Sarvam Bulbul TTFB	300 ms	sarvam.ai/blogs/sarvam-edge — vendor edge claim 260 ms; cloud value observed in production 2026-05-24
Client buffer + playback	100 ms	Derived: tight playback buffer, optimized in production
Total	1,100 ms	Sum of above

Section 2 — STT

"Whisper large-v3 ~2% WER on LibriSpeech test-clean, ~5% on Common Voice 15 English"

verified

LibriSpeech test-clean WER 1.8-2.7%: github.com/openai/whisper/discussions/1762; Common Voice 15 English 5.6%: same source. captured 2026-05-24

"Deepgram Nova-3 streams with ~150 ms TTFT — Deepgram's own benchmark reports 5.26% WER on real-world audio; Artificial Analysis's third-party measurement comes in higher at 12.8% WER"

verified

Deepgram self-reported 5.26%: deepgram.com/learn/deepgram-vs-openai-vs-google-stt. Artificial Analysis 12.8% on same model: third-party benchmark via AA. 150ms TTFT: deepgram.com/learn/engineering-real-time. captured 2026-05-24

"AssemblyAI Universal-3 Pro Streaming lands ~1.56% WER on English benchmarks at $0.0075/min"

verified

AssemblyAI Universal-3 Pro 1.56% English WER and $0.0075/min ($0.45/hr): assemblyai.com/blog/universal-3-pro-streaming + assemblyai.com/pricing. captured 2026-05-24

"NVIDIA Parakeet TDT v3 — the latest and fastest variant — does not support Hindi. The TDT v3 release covers 25 European languages."

verified

Parakeet TDT 0.6B v3 model card lists 25 European languages, no Hindi/Indic: huggingface.co/nvidia/parakeet-tdt-0.6b-v3. captured 2026-05-24

"NVIDIA's older Parakeet RNNT 1.1B multilingual variant does cover Hindi (hi-IN)"

verified

NVIDIA Parakeet RNNT 1.1B Multilingual model card: build.nvidia.com/nvidia/parakeet-1_1b-rnnt-multilingual-asr/modelcard — supports Hindi among 25 languages. captured 2026-05-24

"ElevenLabs Scribe v2 claims 3.1% Hindi WER on FLEURS"

flagged

ElevenLabs blog: elevenlabs.io/speech-to-text/hindi — vendor-only claim, no third-party replication found as of capture. The blog itself flags this as "treat with skepticism." captured 2026-05-24

"Sarvam Saaras v3 at $0.0059/min (₹30/hour)"

verified

Sarvam API pricing: sarvam.ai/api-pricing — ₹30/hour at ₹85/$ ≈ $0.0059/min. captured 2026-05-24

"Sarvam Saaras v3 posts 19.31% WER on the IndicVoices benchmark across the top 10 languages"

vendor

Sarvam Saaras V3 launch blog: sarvam.ai/blogs/asr. Cross-referenced by Business Standard. captured 2026-05-24

"Soniox at $0.002/min with 7.4% Hindi WER on their own benchmark"

vendor

Soniox pricing: soniox.com/pricing. Hindi WER on their own benchmark: soniox.com/compare/soniox-vs-google/hindi. captured 2026-05-24

"Western providers (Google STT, Azure STT) sit at $0.016/min with 13–14% WER on the 2023 AI4Bharat Vistaar benchmark"

verified

AI4Bharat Vistaar benchmark: github.com/AI4Bharat/vistaar — Google STT 14.3%, Azure 13.6% on Kathbath Hindi. Google STT pricing: cloud.google.com/speech-to-text/pricing. captured 2026-05-24

"AI4Bharat themselves report Hindi WER blowing up to 22–30% on telephony audio"

verified

AI4Bharat IndicVoices and ASR area: ai4bharat.iitm.ac.in/areas/asr. captured 2026-05-24

"Groq Whisper Large v3 Turbo at $0.04/hour ≈ $0.00067/min"

verified

Groq pricing page: groq.com/pricing — Whisper Large v3 Turbo at $0.04/hr. Also Distil-Whisper at $0.02/hr available. captured 2026-05-24

"Mistral Voxtral Small (Apache-licensed, 7.69% FLEURS Hindi)"

verified

Voxtral paper, arXiv 2507.13264, Table 4: arxiv.org/html/2507.13264v1. Apache 2.0 license confirmed at huggingface.co/mistralai/Voxtral-Small-24B-2507. captured 2026-05-24

"AI4Bharat IndicConformer-600M-Multi covers all 22 Indian languages, MIT-licensed, RNNT streaming under 100 ms latency"

verified

AI4Bharat IndicConformer-600M-Multi model card: huggingface.co/ai4bharat/indic-conformer-600m-multilingual — 22 languages, RNNT decoder under 100ms. captured 2026-05-24

Chart 2 — STT cost vs Hindi WER

Provider	$/min	Hindi WER %	Source
Sarvam Saaras v3	0.0059	19.31	sarvam.ai/blogs/asr (IndicVoices top-10)
Sarvam Saarika v2.5	0.0059	11.81	sarvam docs (Sarvam-internal Hindi)
ElevenLabs Scribe v2 Realtime	0.0065	3.10 (vendor unrep.)	elevenlabs.io/speech-to-text/hindi
Mistral Voxtral Small	0.006	7.69	Voxtral paper Table 4
Mistral Voxtral Mini	0.003	9.26	Voxtral paper Table 4
AI4Bharat IndicWhisper	$0/min (self-host)	15.00	Vistaar (FLEURS Hindi)
Google STT	0.016	14.30	Vistaar Kathbath Hindi (2023)
Azure STT	0.0167	13.60	Vistaar Kathbath Hindi (2023)
OpenAI Whisper v3	0.006	16.90	OpenAI eval, Whisper v3
Soniox v4 Realtime	0.002	7.40	Soniox vendor benchmark

Section 3 — LLM

"DeepSeek V4 Flash at $0.14/$0.28 per 1M tokens with AA Intelligence Index 36 (non-reasoning)"

verified

DeepSeek pricing: api-docs.deepseek.com/quick_start/pricing — V4 Flash $0.14 cache-miss / $0.28 output, $0.028 with cache. AA Intelligence Index: artificialanalysis.ai/leaderboards/models. captured 2026-05-24

"Claude Haiku 4.5 / Sonnet 4.6 / Opus 4.7 — $1/$5, $3/$15, $5/$25 per 1M tokens"

verified

Anthropic pricing: platform.claude.com/docs/en/about-claude/pricing. captured 2026-05-24

"Anthropic cached input is up to 90% cheaper; 5-minute and 1-hour TTL options"

verified

Anthropic prompt caching docs: platform.claude.com/docs/en/build-with-claude/prompt-caching — cache reads at 0.10× standard input. captured 2026-05-24

"DeepSeek context caching: 98% off ($0.14 → $0.0028 /1M on V4 Flash)"

verified

DeepSeek pricing: api-docs.deepseek.com/quick_start/pricing. captured 2026-05-24

"Llama 3.3 70B (Groq) — AA Intelligence Index 14, output 280 t/s"

verified

AA Llama 3.3 70B providers: artificialanalysis.ai/models/llama-3-3-instruct-70b/providers — Groq 276 t/s. AA Index 14 from leaderboard. captured 2026-05-24

"Claude Sonnet 4.6 in non-reasoning mode at AA Index 44, 1.34s TTFT"

verified

AA leaderboard: artificialanalysis.ai/leaderboards/models. captured 2026-05-24

"OpenAI cached input drops 75–90%"

verified

OpenAI API pricing: developers.openai.com/api/docs/pricing. captured 2026-05-24

Chart 4 — LLM intelligence vs cost vs speed

Bubble chart of voice-agent-relevant LLMs. Every data point sourced below.

Model	$/1M out	AA Index	TTFT (s)	Source
Claude Haiku 4.5 (non-reasoning)	5.00	31	0.89	Anthropic pricing + AA leaderboard
Claude Sonnet 4.6 (non-reasoning)	15.00	44	1.34	Anthropic pricing + AA
GPT-5.4 mini (no reasoning)	4.50	23	0.60	OpenAI pricing + AA
GPT-5.4 nano (no reasoning)	1.25	24	0.66	OpenAI pricing + AA
GPT-5.4 (non-reasoning)	15.00	35	0.83	OpenAI pricing + AA
Gemini 3 Flash	2.50	35	0.90	Google AI pricing + AA
Gemini 2.5 Flash	2.50	30 (rev.)	0.59	Google AI pricing + AA
DeepSeek V4 Flash (non-reasoning)	0.28	36	~1.50	DeepSeek docs + AA
Grok 4.1 Fast	0.50	39	~1.30	xAI docs + AA
Grok 3 mini (high)	0.50	32	0.68	xAI docs + AA
Llama 3.3 70B (Groq)	0.79	14	~0.50	Groq pricing + AA
GPT-OSS 120B (Groq)	0.60	28	~0.55	Groq pricing

Section 4 — TTS

"Cartesia Sonic-3 is the published TTFB winner at ~40 ms"

vendor

Cartesia Sonic-2/3 TTFB 40 ms: docs.cartesia.ai/build-with-cartesia/tts-models/latest. Vendor self-reported. captured 2026-05-24

"Deepgram Aura-2 at 90 ms TTFB"

vendor

Deepgram engineering blog: deepgram.com/learn/engineering-real-time-low-latency-voice-ai-at-scale. captured 2026-05-24

"ElevenLabs Flash v2.5 at 75 ms inference-only"

vendor

ElevenLabs models docs: elevenlabs.io/docs/overview/models. Note: model inference latency excludes network. captured 2026-05-24

"Async.com measured ElevenLabs Flash v2.5 at 251 ms median TTFB from us-central1"

verified

Async TTS latency benchmark Feb 2026: async.com/blog/tts-latency-vs-quality-benchmark. captured 2026-05-24

"Vexyl measured the same model at 478 ms from India"

verified

Vexyl India TTS latency test Jan 2026: vexyl.ai/elevenlabs-tts-latency-test-2026-real-world-results. captured 2026-05-24

"Inworld TTS-1.5 Max is now #1 on Artificial Analysis ELO (1,236)"

verified

AA TTS family page: artificialanalysis.ai/text-to-speech/model-families/inworld — TTS-1.5 Max ELO 1,236 (March 2026), 1,238 (April update). captured 2026-05-24

"Sarvam Bulbul v2 at $18/M chars — 5.5× cheaper than ElevenLabs Multilingual v2 at $100/M"

verified

Sarvam pricing: sarvam.ai/api-pricing — Bulbul v2 ₹15/10k chars ≈ $18/M at ₹85/$. ElevenLabs Multilingual v2 $100/M: elevenlabs.io/pricing/api. captured 2026-05-24

"Deepgram Aura-2 — no Hindi, only 7 languages"

verified

Deepgram TTS models: developers.deepgram.com/docs/tts-models — supports English, Spanish, Dutch, French, German, Italian, Japanese. captured 2026-05-24

Section 5 — Four-axis radar (Chart 7)

Chart 7 normalizes 5 canonical stacks on Cost / Latency / Intelligence / Language axes (0-1 scale). The values are editorial syntheses of the underlying data verified above. Each stack's positioning is defended by the per-component citations in sections 2-4 of this audit.

Stack	Cost	Latency	Intel	Lang	Defense
Cheap English	0.95	0.85	0.30	0.10	$98/mo (lowest), Groq 80ms TTFT, Llama AA 14 (lowest of usable), English-only
Premium English	0.30	0.80	0.85	0.30	$253/mo (expensive), GPT-5.4 AA 35, English + passable Hindi
Production Indic	0.75	0.55	0.65	0.95	$157/mo, ~1.1s end-to-end, Gemini Flash AA 35 + Sarvam Indic-native, 22 Indian languages
Vapi managed	0.10	0.70	0.70	0.30	$1,120/mo (most expensive), GPT-4o-mini AA 23
Gemini Live e2e	0.55	0.90	0.60	0.40	$140/mo, ~700ms (vendor), 9 Indian languages but weak code-switch

Section 6 — Costs (Chart 8)

Scenario

200 operators × 2 calls/day × 60s/call × 20 working days = 8,000 voice-minutes/month. All stacks on Indian telephony for fair comparison.

Per-stack cost decomposition

Cheap English — $98/mo

Component	$/mo	Calc	Source
STT (Whisper.cpp self-host)	$0	OSS, runs on VPS	whisper.cpp + AssemblyAI self-hosting guide
LLM (Llama 3.3 70B on Groq)	$17.50	(3000×$0.59 + 600×$0.79)/1M × 8000 calls	groq.com/pricing
TTS (Piper self-host)	$0	OSS, real-time on CPU	github.com/rhasspy/piper
Self-host VM	$40	~$40/mo VPS (DigitalOcean/Hetzner)	DigitalOcean droplet pricing
Telephony (Plivo India SIP)	$40	$0.005/min × 8000	plivo.com/sip-trunking/pricing/in
Total	$97.50

Production Indic — $157/mo

Component	$/mo	Calc	Source
STT (Sarvam Saaras v3)	$47.20	$0.0059/min × 8000	sarvam.ai/api-pricing
LLM (Gemini 2.5 Flash)	$19.20	(3000×$0.30 + 600×$2.50)/1M × 8000	ai.google.dev/gemini-api/docs/pricing
TTS (Sarvam Bulbul v3)	$50.40	175 chars × 8000 × $36/M	sarvam.ai/api-pricing
Telephony (Plivo India)	$40	$0.005/min × 8000	Plivo India SIP pricing
Total	$156.80

Premium English — $253/mo

Component	$/mo	Calc	Source
STT (Deepgram Nova-3)	$38.40	$0.0048/min × 8000	deepgram.com/pricing
LLM (GPT-5.4 full flagship)	$105.00	Input $60 (with ~60% cache hit at 75% discount) + Output $72 → ~$105 net	openai.com/api/docs/pricing
TTS (ElevenLabs Flash v2.5)	$70.00	175 chars × 8000 × $50/M	elevenlabs.io/pricing/api
Telephony (Plivo India)	$40	$0.005/min × 8000	Plivo India SIP pricing
Total	$253.40

Why so expensive? 41% of the bill is the GPT-5.4 LLM line ($105). GPT-5.4 at $15/1M output is the flagship OpenAI model, charging 19× more per output token than Llama 3.3 70B on Groq. If you swap GPT-5.4 → GPT-5.4-mini, the LLM line drops to ~$30 and Premium English becomes ~$178.

Gemini Live e2e — $140/mo

Component	$/mo	Calc	Source
Audio-native LLM (Gemini 3.1 Flash Live)	$100	$0.005/min audio in (8000) + $0.018/min audio out (3333)	ai.google.dev/gemini-api/docs/pricing
Telephony (Plivo India)	$40	$0.005/min × 8000	Plivo India SIP pricing
Total	$140

Vapi managed — $1,120/mo

Component	$/mo	Calc	Source
Platform fee (Vapi orchestration)	$400	$0.05/min × 8000	vapi.ai/pricing
LLM (GPT-4o-mini at Vapi bundled rates)	$400	~$0.05/min × 8000	vapi pricing breakdown
STT (Deepgram via Vapi)	$40	~$0.005/min × 8000	Vapi-bundled Deepgram rate
TTS (11Labs Turbo at Vapi bundled)	$240	~$0.030/min × 8000	Vapi-bundled 11Labs rate
Telephony (Plivo India)	$40	$0.005/min × 8000	Plivo India SIP pricing
Total (cheapest config)	$1,120		Published $0.15/min cheapest stack × 8000 = $1,200; minor variance is telephony

Premium Vapi configurations (GPT-4o + 11Labs Multilingual) reach $0.30-$0.40/min = $2,400-$3,200/mo. Source: pxlpeak.com/blog/ai-tools/vapi-pricing-breakdown and multiple corroborating 2026 cost-breakdown blogs.

Sections 7-8 — Latency + Cost Optimization (generic)

These sections contain general engineering recommendations (stream every stage, persistent connections, prompt caching, region co-location, VAD tuning) without specific numbers tied to our stack. The patterns named are industry-standard and verifiable via any voice-agent engineering reference. No quantitative claims to audit individually.

Section 9 — Conclusion

The conclusion section makes no new quantitative claims; it summarizes the tradeoff framework established by the verified claims above.

Chart-by-chart citation summary

Chart 1 — Latency waterfall (cumulative ms)

All 6 stage values cited above in Section 1. Total 1,100 ms is the sum.

Chart 2 — STT cost vs Hindi WER (scatter)

10 providers, each with cited $/min and Hindi WER. Marker shape distinguishes vendor self-reported (circle), third-party benchmark (diamond), and vendor-unreplicated (hollow ring — ElevenLabs Scribe v2 3.1%).

Chart 3 — Language coverage matrix (heatmap)

Provider × language tier matrix. Each cell reflects vendor language support docs:

Sarvam Saaras / Bulbul: sarvam.ai/blogs/asr, Bulbul docs — 22 official Indian languages.
Smallest.ai Lightning V3.1: 15 languages, 7 Indian (no Bengali, Punjabi): smallest.ai/blog/introducing-lightning-v3.
Deepgram Aura-2: 7 languages, no Indic: developers.deepgram.com/docs/tts-models.
Parakeet TDT v3: 25 European languages, no Hindi: huggingface.co/nvidia/parakeet-tdt-0.6b-v3.
ElevenLabs Scribe v2: 90+ languages, 11 Indian: elevenlabs.io/realtime-speech-to-text.

Chart 4 — LLM intelligence vs cost (bubble)

12 voice-agent-relevant LLM variants. All TTFT and Intelligence Index values from artificialanalysis.ai/leaderboards/models (May 2026 snapshot). Pricing from each vendor's official pricing page.

Chart 5 — Context window vs blended price

Same model set as Chart 4. Context windows from vendor docs; blended price computed as 0.75 × input + 0.25 × output (voice-agent-heavy input weighting).

Chart 6 — TTS price vs TTFB (scatter)

10 TTS providers. Prices from each vendor's pricing page; TTFB values mostly vendor-published with a few independent benchmarks (Async, Vexyl India). Sarvam Bulbul v3's 300 ms is a production observation since no vendor cloud number is published.

Chart 7 — Four-axis radar

5 canonical stacks normalized 0-1 on Cost / Latency / Intelligence / Language. Editorial synthesis of citations in Sections 2-6 of this audit. Specific values + defenses in the Section 5 table above.

Chart 8 — Monthly cost @ 8k voice-min

5 stacks, all on Indian telephony for fair comparison. Per-component costs all sourced in Section 6 of this audit above.

Chart 10 — Per-call cost breakdown

Derived from Chart 8's "Production Indic" stack ÷ 8,000 calls. Pure arithmetic — verifiable by summing components.

Known caveats and weaknesses

ElevenLabs Scribe v2 3.1% Hindi WER — no third-party replication found as of capture. The blog chart explicitly tags this as "vendor, unreplicated." If a future independent benchmark contradicts this, the chart's most striking data point becomes wrong.
Sarvam Bulbul v3 cloud TTFB — vendor doesn't publish a cloud TTFB number. The 300 ms used in the blog is a production observation consistent with the 260 ms edge claim but not third-party verified.
Vapi cost decomposition — Vapi doesn't publish a per-component breakdown of bundled pricing. Our $400 platform + $400 LLM + $40 STT + $240 TTS decomposition was reverse-engineered from the published $0.15/min cheapest-stack figure. The decomposition is informed guess; the total ($1,120) is well-sourced.
"Cheap English" Self-host VM at $40 — depends on actual VPS choice. DigitalOcean $40/mo droplet runs Whisper.cpp + Piper for our scenario, but for sub-300ms STT latency a GPU instance ($100-200/mo) may be needed. The chart number reflects CPU-only deployment.
Premium English LLM line ($105) — defensible only if you intend "premium" = "flagship OpenAI." Many production English voice agents use GPT-5.4-mini ($0.75/$4.50), which would drop Premium English to ~$178. The blog's $253 represents the upper edge of "premium" stack cost.
Gemini Flash TTFT 590 ms — Artificial Analysis median for Google AI Studio. Google Vertex AI shows higher (780 ms). With aggressive implicit caching, observed production TTFT can be lower. We use the AA median to avoid optimistic claims.

Source index (all URLs cited)

This audit verifies every numerical or factual claim in the series against publicly accessible sources captured 2026-05-24. If you spot a citation that no longer loads or a number that has shifted since capture, let us know — the charts regenerate from source data.

Sources & citations

Cold open + Thesis

Section 1 — The pipeline, end-to-end

Chart 1 — Latency waterfall

Section 2 — STT

Chart 2 — STT cost vs Hindi WER

Section 3 — LLM

Chart 4 — LLM intelligence vs cost vs speed

Section 4 — TTS

Section 5 — Four-axis radar (Chart 7)

Section 6 — Costs (Chart 8)

Scenario

Per-stack cost decomposition

Cheap English — $98/mo

Production Indic — $157/mo

Premium English — $253/mo

Gemini Live e2e — $140/mo

Vapi managed — $1,120/mo

Sections 7-8 — Latency + Cost Optimization (generic)

Section 9 — Conclusion

Chart-by-chart citation summary

Chart 1 — Latency waterfall (cumulative ms)

Chart 2 — STT cost vs Hindi WER (scatter)

Chart 3 — Language coverage matrix (heatmap)

Chart 4 — LLM intelligence vs cost (bubble)

Chart 5 — Context window vs blended price

Chart 6 — TTS price vs TTFB (scatter)

Chart 7 — Four-axis radar

Chart 8 — Monthly cost @ 8k voice-min

Chart 10 — Per-call cost breakdown

Known caveats and weaknesses

Source index (all URLs cited)

Vendor pricing pages

Benchmarks and rankings

Vendor blogs / launches

OSS projects