Provider Benchmarks

Real-world agentic inference benchmarks. Recorded coding sessions replayed against provider endpoints - measuring what matters as context grows beyond 100K tokens.

Providers

Cost vs. Speed

Total session cost against decode speed or wall clock time

Cost vs. Speed

Total session cost against decode speed or wall clock time

Throughput

Decode speed measured in output tokens per second per user

Throughput Distribution

Decode speed ranked by median (p50) — higher is better

0255075100125150175200

DeepSeek v4-flash

DeepSeek

101.5

GLM 5

Z.AI

74.7

GLM 5.1

Z.AI

66.4

Kimi K2.5

Moonshot

60.8

Kimi K2.6

Moonshot

47.5

GLM 4.6

Z.AI

37.9

DeepSeek v4-pro

DeepSeek

31.4

p50 (median)

p95

min–max range

Output tokens/sec per user

Context Scaling

How decode speed changes as context length grows

Context Length vs Throughput

Decode speed as context grows

Latency

Time to First Token (TTFT) across context sizes

TTFT Pareto Frontier

Best-case time to first token at each context size

Provider Details

Expand each provider to see their models and full benchmark breakdowns

Methodology

All benchmarks use asb replay mode - a recorded agentic coding session replayed against each provider endpoint. Cost is derived from per-provider pricing applied to actual token usage.

Read full methodology