CLI Modes

AgenticCodingBench has 5 modes - from quick speed tests to full agentic session recording and replay.

Overview

Each mode targets a different dimension of LLM serving performance under agentic coding workloads. Use speed for inference benchmarking, eval for correctness, agent for real session measurement, and record / replay for capturing and replaying your own workloads.

acb speed

Inference speed under agentic load

Sends streaming requests with realistic agentic coding context (system prompts, tool schemas, file contents, conversation history) directly to any OpenAI-compatible endpoint.

Key Metrics

TTFTTok/s per userITL (p50/p95/p99)Prefill tok/sAggregate throughputReasoning overhead

Examples

Default realistic sweep

acb speed -e http://localhost:8000 -m my-model

Specific concurrency and context

acb speed -e http://localhost:8000 -m my-model -u 32 -p long

Fixed token count stress test

acb speed -e http://localhost:8000 -m my-model -c 50000 -u 16

Measure prefix cache impact

acb speed -e http://localhost:8000 -m my-model --cache-mode both

JSON output for CI/CD

acb speed -e http://localhost:8000 -m my-model --format json -o results.json

acb eval

Code correctness validation

Sends agentic coding tasks and validates the generated code at three levels: syntax (does it parse?), execution (does it run?), and functional (does it produce correct output?).

Key Metrics

Syntax pass rateExecution pass rateFunctional correctnessTier breakdown

Examples

Syntax validation

acb eval -e http://localhost:8000 -m my-model -t p1-p25 -v syntax

Execution validation

acb eval -e http://localhost:8000 -m my-model -t p1-p25 -v execution

Functional validation

acb eval -e http://localhost:8000 -m my-model -t p1-p25 -v functional

acb agent

Full agentic session benchmark via recording proxy

Runs a recording proxy between a real coding agent (like Claude Code) and your endpoint, measuring actual multi-turn agentic sessions. The proxy translates Anthropic Messages API → OpenAI Chat Completions API and records per-request timing.

Key Metrics

Session TTFTMulti-turn latency growthTool call overheadContext window scaling

Examples

Run agent benchmark

acb agent \
  -e http://localhost:8000 \
  -m my-model \
  -t p1-p10

acb record

Capture real coding sessions as JSONL workloads

Starts a recording proxy between your coding agent and your LLM endpoint. Every request/response pair is saved as a JSONL line. Then replay against any endpoint.

Key Metrics

Request countContext sizesTool calls

Examples

Record with OpenAI-compatible upstream

acb record \
  -e http://your-gpu-server:8000 \
  -m your-model

Record with Anthropic

acb record \
  -e https://api.anthropic.com \
  -m claude-sonnet-4-20250514 \
  -k $ANTHROPIC_API_KEY \
  --api-key-header x-api-key \
  -o my-session.jsonl

acb replay

Replay captured workloads against any endpoint

Takes a recorded workload and replays it against a different endpoint, hardware, or configuration. Requests are grouped by context size and produce the same metrics as speed mode.

Key Metrics

TTFTTok/sITLComparison delta

Examples

Replay against a new endpoint

acb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl

Replay with report

acb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl \
  -o report.md

Helper Commands

acb list-tasks - Browse Available Tasks

acb list-tasks                        # Show all 110 tasks
acb list-tasks -t trivial             # Filter by tier
acb list-tasks --tags typescript,rust  # Filter by language

acb list-workloads - Browse Built-in Workloads

acb list-workloads --format json

acb compare - Compare Two Runs

acb compare --baseline a.json --candidate b.json -o comparison.md