by SwarmOne

AgenticCodingBench

The open-source benchmark for LLM inference under agentic coding workloads. Measure what actually matters: speed from 6K to 400K tokens.

$pip install agentic-coding-bench
View on GitHub
terminal
$

Why Another Benchmark?

Existing benchmarks don't test what agentic coding tools actually do - growing multi-turn contexts with tool calls, code files, and error traces.

SWE-bench

Measures
Model quality
Context size
Varies
Request pattern
Single-turn
Content
GitHub issues
Cache impact
N/A

LMSys Arena

Measures
Chatbot speed
Context size
~2K
Request pattern
Single-turn
Content
Chat messages
Cache impact
N/A

Generic Benchmarks

Measures
Uniform throughput
Context size
Uniform
Request pattern
Uniform
Content
Generic text
Cache impact
N/A

ACB

Measures
Agentic inference speed
Context size
6K → 400K (growing)
Request pattern
Multi-turn with tools
Content
Tool schemas, code, errors
Cache impact
Cold vs warm

What It Measures

Seven key metrics that determine whether your LLM serving stack is ready for agentic coding.

TTFT

Time to First Token

How long until the first token arrives. Critical for perceived responsiveness in editors.

ms

Tok/s per user

Decode Tokens/Second

Streaming speed per concurrent user. Determines how fast code appears.

tok/s

Prefill tok/s

Prefill Tokens/Second

Speed of processing input context. Bottleneck for large contexts.

tok/s

ITL

Inter-Token Latency

Time between consecutive tokens at p50/p95/p99. Drives streaming smoothness.

ms

Throughput

Aggregate Throughput

Total tokens/second across all concurrent users. Measures serving capacity.

tok/s

Reasoning Overhead

Reasoning Token Overhead

Extra latency from chain-of-thought or thinking tokens before visible output.

ms

Cache Speedup

Prefix Cache Speedup

Cold vs warm TTFT ratio. Shows prefix caching effectiveness for repeated contexts.

×

7 Context Profiles

Real coding sessions grow from 6K to 400K tokens. ACB tests every stage of that journey.

6K
fresh
20K
short
40K
medium
70K
long
100K
full
200K
xl
400K
xxl

Each profile includes system prompts, tool schemas, code files, conversation history, and error traces

5 Modes, One Tool

From quick speed tests to full agentic session recording and replay.

acb speed

Inference speed under agentic coding load

Key Metrics

TTFTTok/sITLPrefillThroughput

Usage

$ acb speed \
  --endpoint http://localhost:8000 \
  --model my-model \
  --suite quick

Quick Start

Three steps to benchmark your serving stack.

1
Install
$ pip install agentic-coding-bench
2
Run
$ acb speed \
  --endpoint http://localhost:8000 \
  --model my-model \
  --suite quick
3
Docker
$ docker run --rm \
  -e ENDPOINT=http://host.docker.internal:8000 \
  -e MODEL=my-model \
  ghcr.io/swarmone/acb:latest speed --suite quick

Sample Report Output

Every run produces a verdict with key findings and a detailed breakdown.

meta-llama/Meta-Llama-3.1-70B
http://localhost:8000 • suite: standard
🟢 GOOD

Key Findings

  • TTFT stays under 3s through 40K context - responsive for active coding
  • Decode rate holds above 30 tok/s per user up to 100K context
  • Prefix caching delivers 3.5× TTFT speedup at medium context
  • p95 ITL spikes above 50ms at 100K context - may cause visible streaming stutter
  • 8-user concurrency degrades TTFT by 4-5× versus single-user baseline

Summary Table

ContextUsersTTFTTok/sVerdict
fresh (6K)1180ms65 tok/s🟢 GOOD
short (20K)1520ms58 tok/s🟢 GOOD
medium (40K)11.1s48 tok/s🟢 GOOD
medium (40K)84.8s14 tok/s🟡 MARGINAL
long (70K)12.2s38 tok/s🟢 GOOD
full (100K)14.2s30 tok/s🟢 GOOD
full (100K)815.0s8 tok/s🔴 POOR

What Good Looks Like

Reference ranges from real hardware. Use these as baselines when evaluating your own results.

SetupContextUsersTTFTTok/s per userVerdict
vLLM 1×A100, 8B6K1~100ms~80-120🟢 GOOD
vLLM 1×A100, 8B40K1~600ms~60-95🟢 GOOD
vLLM 1×A100, 8B40K8~2-4s~20-40🟡 MARGINAL
vLLM 1×A100, 8B100K1~2s~50-70🟢 GOOD
vLLM 1×A100, 8B100K8~8-12s~10-18🔴 POOR
SGLang 1×H100, 70B6K1~180ms~55-65🟢 GOOD
SGLang 1×H100, 70B40K1~1.1s~40-50🟢 GOOD
SGLang 1×H100, 70B100K1~4.2s~25-35🟢 GOOD
TTFT < 3s at 40K
Responsive editing experience
Tok/s > 30/user
Smooth code streaming
TTFT < 10s at 100K
Acceptable for deep sessions