CLI Modes
AgenticCodingBench has 5 modes - from quick speed tests to full agentic session recording and replay.
Overview
Each mode targets a different dimension of LLM serving performance under agentic coding workloads. Use speed for inference benchmarking, eval for correctness, agent for real session measurement, and record / replay for capturing and replaying your own workloads.
acb speed
Inference speed under agentic load
Sends streaming requests with realistic agentic coding context (system prompts, tool schemas, file contents, conversation history) directly to any OpenAI-compatible endpoint.
Key Metrics
Examples
Default realistic sweep
acb speed -e http://localhost:8000 -m my-modelSpecific concurrency and context
acb speed -e http://localhost:8000 -m my-model -u 32 -p longFixed token count stress test
acb speed -e http://localhost:8000 -m my-model -c 50000 -u 16Measure prefix cache impact
acb speed -e http://localhost:8000 -m my-model --cache-mode bothJSON output for CI/CD
acb speed -e http://localhost:8000 -m my-model --format json -o results.jsonacb eval
Code correctness validation
Sends agentic coding tasks and validates the generated code at three levels: syntax (does it parse?), execution (does it run?), and functional (does it produce correct output?).
Key Metrics
Examples
Syntax validation
acb eval -e http://localhost:8000 -m my-model -t p1-p25 -v syntaxExecution validation
acb eval -e http://localhost:8000 -m my-model -t p1-p25 -v executionFunctional validation
acb eval -e http://localhost:8000 -m my-model -t p1-p25 -v functionalacb agent
Full agentic session benchmark via recording proxy
Runs a recording proxy between a real coding agent (like Claude Code) and your endpoint, measuring actual multi-turn agentic sessions. The proxy translates Anthropic Messages API → OpenAI Chat Completions API and records per-request timing.
Key Metrics
Examples
Run agent benchmark
acb agent \
-e http://localhost:8000 \
-m my-model \
-t p1-p10acb record
Capture real coding sessions as JSONL workloads
Starts a recording proxy between your coding agent and your LLM endpoint. Every request/response pair is saved as a JSONL line. Then replay against any endpoint.
Key Metrics
Examples
Record with OpenAI-compatible upstream
acb record \
-e http://your-gpu-server:8000 \
-m your-modelRecord with Anthropic
acb record \
-e https://api.anthropic.com \
-m claude-sonnet-4-20250514 \
-k $ANTHROPIC_API_KEY \
--api-key-header x-api-key \
-o my-session.jsonlacb replay
Replay captured workloads against any endpoint
Takes a recorded workload and replays it against a different endpoint, hardware, or configuration. Requests are grouped by context size and produce the same metrics as speed mode.
Key Metrics
Examples
Replay against a new endpoint
acb replay \
-e http://new-server:8000 \
-m my-model \
-w my-session.jsonlReplay with report
acb replay \
-e http://new-server:8000 \
-m my-model \
-w my-session.jsonl \
-o report.mdHelper Commands
acb list-tasks - Browse Available Tasks
acb list-tasks # Show all 110 tasksacb list-tasks -t trivial # Filter by tieracb list-tasks --tags typescript,rust # Filter by languageacb list-workloads - Browse Built-in Workloads
acb list-workloads --format jsonacb compare - Compare Two Runs
acb compare --baseline a.json --candidate b.json -o comparison.md