AgenticCodingBench
The open-source benchmark for LLM inference under agentic coding workloads. Measure what actually matters: speed from 6K to 400K tokens.
Why Another Benchmark?
Existing benchmarks don't test what agentic coding tools actually do - growing multi-turn contexts with tool calls, code files, and error traces.
SWE-bench
- Measures
- Model quality
- Context size
- Varies
- Request pattern
- Single-turn
- Content
- GitHub issues
- Cache impact
- N/A
LMSys Arena
- Measures
- Chatbot speed
- Context size
- ~2K
- Request pattern
- Single-turn
- Content
- Chat messages
- Cache impact
- N/A
Generic Benchmarks
- Measures
- Uniform throughput
- Context size
- Uniform
- Request pattern
- Uniform
- Content
- Generic text
- Cache impact
- N/A
ACB
- Measures
- Agentic inference speed
- Context size
- 6K → 400K (growing)
- Request pattern
- Multi-turn with tools
- Content
- Tool schemas, code, errors
- Cache impact
- Cold vs warm
What It Measures
Seven key metrics that determine whether your LLM serving stack is ready for agentic coding.
TTFT
Time to First Token
How long until the first token arrives. Critical for perceived responsiveness in editors.
Tok/s per user
Decode Tokens/Second
Streaming speed per concurrent user. Determines how fast code appears.
Prefill tok/s
Prefill Tokens/Second
Speed of processing input context. Bottleneck for large contexts.
ITL
Inter-Token Latency
Time between consecutive tokens at p50/p95/p99. Drives streaming smoothness.
Throughput
Aggregate Throughput
Total tokens/second across all concurrent users. Measures serving capacity.
Reasoning Overhead
Reasoning Token Overhead
Extra latency from chain-of-thought or thinking tokens before visible output.
Cache Speedup
Prefix Cache Speedup
Cold vs warm TTFT ratio. Shows prefix caching effectiveness for repeated contexts.
7 Context Profiles
Real coding sessions grow from 6K to 400K tokens. ACB tests every stage of that journey.
Each profile includes system prompts, tool schemas, code files, conversation history, and error traces
5 Modes, One Tool
From quick speed tests to full agentic session recording and replay.
acb speed
Inference speed under agentic coding load
Key Metrics
Usage
$ acb speed \
--endpoint http://localhost:8000 \
--model my-model \
--suite quickQuick Start
Three steps to benchmark your serving stack.
$ pip install agentic-coding-bench$ acb speed \
--endpoint http://localhost:8000 \
--model my-model \
--suite quick$ docker run --rm \
-e ENDPOINT=http://host.docker.internal:8000 \
-e MODEL=my-model \
ghcr.io/swarmone/acb:latest speed --suite quickSample Report Output
Every run produces a verdict with key findings and a detailed breakdown.
Key Findings
- TTFT stays under 3s through 40K context - responsive for active coding
- Decode rate holds above 30 tok/s per user up to 100K context
- Prefix caching delivers 3.5× TTFT speedup at medium context
- p95 ITL spikes above 50ms at 100K context - may cause visible streaming stutter
- 8-user concurrency degrades TTFT by 4-5× versus single-user baseline
Summary Table
| Context | Users | TTFT | Tok/s | Verdict |
|---|---|---|---|---|
| fresh (6K) | 1 | 180ms | 65 tok/s | 🟢 GOOD |
| short (20K) | 1 | 520ms | 58 tok/s | 🟢 GOOD |
| medium (40K) | 1 | 1.1s | 48 tok/s | 🟢 GOOD |
| medium (40K) | 8 | 4.8s | 14 tok/s | 🟡 MARGINAL |
| long (70K) | 1 | 2.2s | 38 tok/s | 🟢 GOOD |
| full (100K) | 1 | 4.2s | 30 tok/s | 🟢 GOOD |
| full (100K) | 8 | 15.0s | 8 tok/s | 🔴 POOR |
What Good Looks Like
Reference ranges from real hardware. Use these as baselines when evaluating your own results.
| Setup | Context | Users | TTFT | Tok/s per user | Verdict |
|---|---|---|---|---|---|
| vLLM 1×A100, 8B | 6K | 1 | ~100ms | ~80-120 | 🟢 GOOD |
| vLLM 1×A100, 8B | 40K | 1 | ~600ms | ~60-95 | 🟢 GOOD |
| vLLM 1×A100, 8B | 40K | 8 | ~2-4s | ~20-40 | 🟡 MARGINAL |
| vLLM 1×A100, 8B | 100K | 1 | ~2s | ~50-70 | 🟢 GOOD |
| vLLM 1×A100, 8B | 100K | 8 | ~8-12s | ~10-18 | 🔴 POOR |
| SGLang 1×H100, 70B | 6K | 1 | ~180ms | ~55-65 | 🟢 GOOD |
| SGLang 1×H100, 70B | 40K | 1 | ~1.1s | ~40-50 | 🟢 GOOD |
| SGLang 1×H100, 70B | 100K | 1 | ~4.2s | ~25-35 | 🟢 GOOD |