$ Leaderboard
Real benchmark results from the AgenticCodingBench suite. Every row is a reproducible configuration tested under agentic coding workloads - TTFT, throughput, and latency from 6K to 400K token contexts.
10 entries
| # | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| 🥇 | vLLM on 1×A100 80GB - Llama 3.1 8B | vLLM | Llama 3.1 8B | 1×A100 80GB | 95ms | 620ms | 95 | 🟢 GOOD | 2026-03-15 |
| 🥈 | vLLM on 1×A100 80GB - Llama 3.1 8B (8 users) | vLLM | Llama 3.1 8B | 1×A100 80GB | 320ms | 2.8s | 28 | 🟡 MARGINAL | 2026-03-15 |
| 🥉 | SGLang on 1×H100 - Llama 3.1 70B | SGLang | Llama 3.1 70B | 1×H100 80GB | 180ms | 1.1s | 48 | 🟢 GOOD | 2026-03-20 |
| 4 | SGLang on 1×H100 - Llama 3.1 70B (8 users) | SGLang | Llama 3.1 70B | 1×H100 80GB | 650ms | 4.8s | 14 | 🟡 MARGINAL | 2026-03-20 |
| 5 | TGI on 1×A100 80GB - Llama 3.1 8B | TGI | Llama 3.1 8B | 1×A100 80GB | 110ms | 750ms | 85 | 🟢 GOOD | 2026-03-18 |
| 6 | Together AI API - Llama 3.1 70B | API | Llama 3.1 70B | Managed | 250ms | 1.4s | 42 | 🟢 GOOD | 2026-03-22 |
| 7 | Fireworks API - Llama 3.1 70B | API | Llama 3.1 70B | Managed | 210ms | 1.2s | 48 | 🟢 GOOD | 2026-03-25 |
| 8 | vLLM on 1×A100 80GB - Llama 3.1 8B (32 users) | vLLM | Llama 3.1 8B | 1×A100 80GB | 1.2s | 12.0s | 6 | 🔴 POOR | 2026-03-28 |
| 9 | vLLM on 2×H100 - DeepSeek R1 (Reasoning) | vLLM | DeepSeek R1 | 2×H100 80GB | 450ms | 2.8s | 32 | 🟡 MARGINAL | 2026-04-01 |
| 10 | SGLang on 1×H100 - Llama 3.1 70B (Cache test) | SGLang | Llama 3.1 70B | 1×H100 80GB | 185ms | 1.1s | 50 | 🟢 GOOD | 2026-04-05 |
Submit Your Results
Got a serving stack, model, or hardware combo that's not listed? Run the benchmark and submit your results via pull request. Every entry requires a reproducible configuration and raw metrics.
1. Install: pip install agentic-coding-bench
2. Run: acb speed --endpoint YOUR_ENDPOINT --model YOUR_MODEL --suite full
3. Submit the JSON report as a PR to data/leaderboard/