3. LLM: the inner triangle
The LLM is where cost, latency, and intelligence pull in three different directions hardest.
Throughput-per-minute and context window? Mostly irrelevant for voice agents. Two things actually matter.
- TTFT (time to first token). Sets the floor on perceived latency. Anything over ~700 ms feels sluggish.
- Cost per call. A 60-second voice call generates roughly 3,000 input tokens (system prompt + 7 turns of transcript) and 600 output tokens (7 short agent replies). Pick that envelope, and the model picks itself.
Here’s the voice-agent-relevant LLM frontier. Non-reasoning variants only, because anything with reasoning enabled has 5+ second TTFT and is unusable for live conversation.
Three things jump out.
The Indian/OSS cluster on the left is competitive. DeepSeek V4 Flash at $0.14/$0.28 per 1M tokens with AA Intelligence Index 36 (non-reasoning) is, for voice-agent purposes, smarter than Claude Haiku 4.5 and 7× cheaper. (DeepSeek pricing, Artificial Analysis leaderboard, 2026-05-24.) Grok 4.1 Fast at $0.20/$0.50 with AA Index 39 is similar. The Western premium is real but it isn’t 10×.
Big bubbles aren’t smart. Llama 3.3 70B on Groq is the giant teal bubble at 280 tokens/sec. Beautiful throughput, AA Index 14. Fast doesn’t mean smart enough for slot-filling on a Hinglish work order. The voice-agent sweet spot is a small-medium bubble at $0.25-$1.00/1M output: Gemini 3 Flash, DeepSeek V4 Flash, Grok 4.1 Fast, GPT-OSS 120B on Groq.
The reasoning models aren’t on this chart. Claude Sonnet 4.6 in non-reasoning mode is here (AA Index 44 at 1.34s TTFT). With max reasoning it would be at AA 52 and 81 seconds of TTFT. You can’t have a voice agent that takes 81 seconds to start speaking. Reasoning is for offline batch.
Prompt caching changes everything
The single biggest cost lever on a system-prompt-heavy voice agent is prompt caching. (Every voice agent is system-prompt-heavy because the agent needs ~1500 tokens of instructions, asset catalog, and Hinglish examples to work.)
Numbers from the research:
| Provider | Cached input discount | TTL | Source |
|---|---|---|---|
| Anthropic Claude (5-min cache) | 90% off cache reads | 5 minutes | claude.com pricing |
| Anthropic Claude (1-hour cache) | 80% off cache reads, +25% on writes | 1 hour | same |
| Google Gemini (implicit) | ~90% off cached tokens | implicit | ai.google.dev pricing |
| OpenAI cached input | 75–90% off | implicit (10-min retention) | developers.openai.com pricing |
| DeepSeek context caching | 98% off | implicit | api-docs.deepseek.com |
| xAI Grok cached input | 75% off | implicit | docs.x.ai |
A worked example: 1500-token system prompt on Claude Sonnet 4.6, 1000 calls/day.
- Without cache: 1000 × 1500 × $3.00/1M = $4.50/day input alone.
- With 5-min cache (assume 90% hit rate): 1000 × 1500 × ($3.00 × 0.1 + $3.00 × 0.9 × 0.10) = $0.66/day.
That’s 7× the cost wiped out by one config flag. If your voice agent doesn’t use prompt caching, you’re paying for nothing.
Context window pricing
A million-token context is now table stakes. But it is not free, and it is not flat across vendors:
DeepSeek V4 Flash gives you a 1M-token context at $0.21 blended per 1M tokens. Claude Sonnet 4.6 gives you the same 1M context at $6 blended. That’s a 29× price spread for the same capability ceiling. Obviously you’re paying for AA Index 44 on Sonnet vs 36 on DeepSeek, but still.
For voice agents, the practical context window need is much smaller than this suggests. A 7-turn call uses maybe 4,000 tokens of history. The 200K window on Haiku 4.5 is already overkill. Optimize for cost-per-call, not context.
4. TTS, where naturalness costs you
TTS is the dirtiest fight in voice AI in 2026. Lots of marketing, less honesty.
The metric that matters is TTFB (time to first audio byte). Total synthesis time doesn’t matter as much as you’d think. Audio streams, the client buffers, the operator hears something. Latency is felt at the first byte, not the last.
The story this chart tells:
Cartesia Sonic-3 is the published TTFB winner at ~40 ms. Deepgram Aura-2 at 90 ms. ElevenLabs Flash v2.5 at 75 ms inference-only. These are vendor-published numbers.
Independent benchmarks tell a different story. Async.com measured ElevenLabs Flash v2.5 at 251 ms median TTFB from us-central1. Vexyl measured the same model at 478 ms from India. That’s 6× the vendor claim. The vendor number excludes network round-trip, which is exactly the thing voice agents can’t exclude. (Async TTS benchmark, Vexyl India test, both captured 2026-05-24.)
Sarvam Bulbul v3 has no published TTFB. The 300 ms point on this chart is measured on our production cascade. Observably correct on the apps/voice stack but not reproducible from a vendor page. That’s the chart’s honest gap.
Inworld TTS-1.5 Max is now #1 on Artificial Analysis ELO (1,236), better than ElevenLabs v3 (1,179), and 5-20× cheaper. This rewrites the “ElevenLabs is the quality benchmark” assumption that dominated 2024-25 voice agent design. (Artificial Analysis TTS, 2026-05-24.)
Sarvam Bulbul v2 at $18/M chars is dramatically cheaper than every Western Hindi-capable provider. 5.5× cheaper than ElevenLabs Multilingual v2 at $100/M. For Indian manufacturing voice agents this is the dominant economic argument.
Three TTS providers we’d never use again for English voice agents
- Deepgram Aura-2. Excellent English quality, 90 ms TTFB, but no Hindi. Only 7 languages. If you ever need to expand beyond English you have to swap. (Deepgram TTS models.)
- Rime AI Mist. English-only. Strong streaming but again, language-limited.
- Speechmatics TTS. New entrant Q2 2026, English-only at launch.
TTS providers worth knowing about
The chart above shows the providers most voice-agent teams actually pick from, but a few more are worth noting:
- MiniMax Speech 2.6 HD. $100/M chars, Artificial Analysis ELO ~1,156 (second tier behind Inworld and ElevenLabs v3). Strong multilingual; emerging quality contender.
- Hume AI Octave 2. $7.60/M chars at the high tier with ~100 ms latency. The cheapest “natural-enough” TTS that doesn’t sacrifice prosody. Worth piloting for English-leaning agents.
- Kokoro 82M (open-weight). ~$0.70/M chars effective cost when self-hosted (Apache-licensed), Artificial Analysis ELO ~1,059, runs real-time on a CPU. The cheapest viable OSS English TTS in 2026.
The 2024 instinct was “use ElevenLabs for everything.” The 2026 reality is: pick by language coverage first, latency second, quality third. Inworld and Sarvam will likely take ElevenLabs’s voice-agent market share over the next 12 months. And the open-weight options (Kokoro, Piper) have closed the gap enough that English self-hosting is genuinely viable.