Documentation

Build with Benchlist.

A CLI, three SDKs, one canonical run format. Ship an attested score in about three minutes.

Benchlist CLI — a terminal running a benchmark submission

One API call.

Posting a test is a single HTTP request. No CLI required, no uploads, no dashboard to open. You call POST /v1/run with a (service, model, benchmark) tuple; we run it on a staked attestor, commit the Merkle root, submit the proof to Aligned Layer, and settle on Ethereum L1. The response includes a verify_url that’s live within ~3 minutes of proof verification.

curl -X POST https://api.benchlist.ai/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service":   "anthropic-claude",
    "model":     "claude-opus-4-7",
    "benchmark": "mbpp",
    "runs":      3
  }'

# → 202 Accepted
# {
#   "run_id":    "run-8f3a...",
#   "status":    "queued",
#   "est_seconds": 180,
#   "charge":    { "credits": 1, "usd": 5.00 },
#   "verify_url": "https://benchlist.ai/verify/run-8f3a..."
# }

That’s the whole flow. The response updates its status as it moves queued → running → committed → proving → verified. Subscribe to a webhook on run.verified to ship badges or trigger downstream jobs. $5 per run deducts from your credit balance — including the Ethereum mainnet gas your proof settles under.

Prefer a typed SDK? We ship pip install benchlist, npm i @benchlist/sdk, and a Go client — see /sdk. Prefer a CLI for CI wiring? Keep reading.

Get an API key

Free to sign up. Email verification only — no card on signup, no activation fee, no subscription. Drop your email at /submit and we mail you a bl_live_… Bearer key. Your first attested test is on us.

After the free test — pay as you go

$5 per attested test. Top up whenever with a credit pack (up to 33% off volume). Two payment paths, same outcome:

  • Card/pricing → pick a pack → Stripe Checkout.
  • Crypto/pricing“Pay with ETH”. Send on Base (recommended, ~$0.01 gas), Ethereum L1, or Arbitrum; paste the tx hash; /v1/crypto reads the receipt on-chain and credits your account.
# Export the key and re-use anywhere
export BENCHLIST_KEY=bl_live_...

Rotate keys with POST /v1/keys/rotate. Issue scoped sub-keys per environment. Full auth reference: /api#auth.

Install the CLI

The reference runner is a single pipx-installable Python package. It wraps the benchmark runner, the committer (Merkle/hash), and the Aligned submitter.

pipx install benchlist-runner

# OR npm global
npm i -g @benchlist/cli

Verify:

benchlist --version
# benchlist-runner 1.0.2 (sp1 v4.2.3, aligned-sdk v2.1.0)

Run your first benchmark

Say you want to benchmark your LLM provider on MBPP.

export ANTHROPIC_API_KEY=sk-ant-...

benchlist run mbpp \
  --service anthropic-claude \
  --model claude-opus-4-7 \
  --runs 3 \
  --out claude-mbpp.json

The runner will:

  1. Download the pinned dataset (caches locally in ~/.benchlist/datasets/)
  2. Hash the dataset and verify it matches the methodology
  3. Query your service N times over the full problem set
  4. Score with the pinned scoring function
  5. Hash every transcript into a Merkle tree
  6. Emit a canonical run.json

Publish a listing

Two options. CLI publishes directly; web lets you paste JSON.

CLI

benchlist commit claude-mbpp.json
benchlist prove claude-mbpp.json --system sp1
benchlist submit claude-mbpp.json --network ethereum
# → batch_id: 0x3c5d...9a1b (waiting for verification...)
# → verified at block 22184921
benchlist publish claude-mbpp.json
# → https://benchlist.ai/verify/run-claude-mbpp-001

Web

Paste the output of benchlist prove into /submit. We verify the proof against Aligned's batch explorer and publish within 2 minutes.

Services

A service is an AI-adjacent product: an LLM API, a memory substrate, a code agent, a vector DB, etc. Each service has a stable ID (slug), a category, metadata, and a JSON schema.

Services don't host benchmark runs directly — runs reference the service by ID. This lets you update the service description or URL without invalidating historical scores.

Benchmarks

A benchmark suite is defined by two hashes:

  • datasetHash: SHA-256 of the canonical evaluation set
  • methodologyHash: SHA-256 of the runner repo at a specific commit

Change either, and you've created a new version of the benchmark. Old runs don't transfer. This prevents silent benchmark drift.

Runs + commitments

A run is a specific (service, model, config) executed against a specific benchmark suite. Every run produces:

  • a score (the primary metric)
  • a breakdown (per-category subscores, if applicable)
  • a transcript (the full list of (prompt, response, judge) tuples)
  • a Merkle root over the transcript
  • a commitment = hash(datasetHash || methodologyHash || merkleRoot || score)

The commitment is what actually gets signed and submitted to Aligned.

Attestors

An attestor is a runner that executes benchmarks and signs results. The reference attestor (benchlist-runner-0) is operated by Benchlist itself, but anyone can join the registry by:

  1. Running benchlist attestor init — generates an Ed25519 keypair
  2. Posting ≥ 1 ETH stake to the StakeVault contract
  3. Publishing a DNS TXT record proving ownership of their domain
  4. Submitting a PUT /attestors request with their pubkey + metadata

Misconduct (upheld disputes) slashes the stake.

Aligned Layer

Aligned is a proof aggregation network that settles on Ethereum L1. Every commitment produced by a runner is packaged as a proof, submitted to Aligned's operator set, and verified on-chain. Once verified, the batch ID becomes the listing's credential.

See the integration spec for wire format.