Build with Benchlist.

A CLI, three SDKs, one canonical run format. Ship an attested score in about three minutes.

Benchlist CLI, a terminal running a benchmark submission

One API call.

Posting a test is a single HTTP request. No CLI required, no uploads, no dashboard to open. You call POST /v1/run with a (service, model, benchmark) tuple; we run it on a staked attestor, commit the Merkle root, submit the proof to Aligned Layer, and settle on Ethereum L1. The response includes a verify_url that’s live within ~3 minutes of proof verification.

curl -X POST https://api.benchlist.ai/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service":   "anthropic-claude",
    "model":     "claude-opus-4-7",
    "benchmark": "mbpp",
    "runs":      3
  }'

# → 202 Accepted
# {
#   "run_id":    "run-8f3a...",
#   "status":    "queued",
#   "est_seconds": 180,
#   "charge":    { "credits": 1, "usd": 5.00 },
#   "verify_url": "https://benchlist.ai/verify/run-8f3a..."
# }

That’s the whole flow. The response updates its status as it moves queued → running → committed → proving → verified. Subscribe to a webhook on run.verified to ship badges or trigger downstream jobs. $5 per run deducts from your credit balance, including the Ethereum mainnet gas your proof settles under.

Prefer a typed SDK? We ship pip install benchlist, npm i @benchlist/sdk, and a Go client, see /sdk. Prefer a CLI for CI wiring? Keep reading.

Get an API key

Free to sign up. Email verification only, no card on signup, no activation fee, no subscription. Drop your email at /submit and we mail you a bl_live_… Bearer key. Load credits when you're ready to sign (packs from $25 / 6 tests).

Pay as you go

$5 per attested test. Top up whenever with a credit pack (from $25 / 6 tests, up to 33% off at volume). Two payment paths, same outcome:

Card, /pricing → pick a pack → Stripe Checkout.
Crypto, /pricing → “Pay with ETH”. Send on Base (recommended, ~$0.01 gas), Ethereum L1, or Arbitrum; paste the tx hash; /v1/crypto reads the receipt on-chain and credits your account.

# Export the key and re-use anywhere
export BENCHLIST_KEY=bl_live_...

Rotate keys with POST /v1/keys/rotate. Issue scoped sub-keys per environment. Full auth reference: /api#auth.

Running your own attestor (optional)

If you want to run your own attestor (not required, Benchlist operates one for you), pick a proof system per run via --system sp1 | risc0 | signed. Both SP1 and Risc0 are first-class, the same run.json + Merkle commitment + Aligned Layer batch verification path works for either. The runner selects a prover in this order per system:

Remote, SP1_PROVER_URL + SP1_API_KEY (Succinct Prover Network) or BONSAI_API_URL + BONSAI_API_KEY (Risc0 Bonsai). ~$1-3 per proof, no GPU needed.
Local CLI, sp1-prover (curl -L https://sp1.succinct.xyz | bash && sp1up) or r0vm from Risc0, generates proofs on your own NVIDIA GPU (RTX 4090 / 5090 / A100 / H100).
Signed-attestation fallback, Ed25519, ~1 ms, no GPU, no Rust. Runs show as "Attested" (indigo pill) instead of "Verified ⛓" (green pill) until you add a ZK prover or post the signed attestation on-chain.

# pick the prover per run
python runner/benchlist.py prove run.json --system sp1     # default zk path
python runner/benchlist.py prove run.json --system risc0   # risc0 zkvm
python runner/benchlist.py prove run.json --system signed  # ed25519 fallback

See prove-local-vs-remote for the break-even math.

Signed-attestation mode

When a full ZK proof isn't worth the latency (nightly regression runs, small pilots), use Ed25519 attestation:

pip install pynacl
python runner/benchlist.py prove run.json --system signed

# On-chain anchor (optional, self-send calldata tx, ~$0.50 gas)
export ATTESTOR_ETH_RPC=https://eth-mainnet.g.alchemy.com/v2/<KEY>
export ATTESTOR_PRIVATE_KEY=0x...
python runner/benchlist.py submit run.json --network ethereum

# Anyone can replay the signature locally (no server required)
python runner/benchlist.py verify run.json

The same check runs in-browser on /verify/:id via @noble/ed25519, click "Verify Ed25519 signature" on any signed-attestation run.

Install the CLI

The reference runner is a single pipx-installable Python package. It wraps the benchmark runner, the committer (Merkle/hash), and the Aligned submitter.

pipx install benchlist-runner

# OR npm global
npm i -g @benchlist/cli

Verify:

benchlist --version
# benchlist-runner 1.0.2 (sp1 v4.2.3, aligned-sdk v2.1.0)

Run your first benchmark

Say you want to benchmark your LLM provider on MBPP.

export ANTHROPIC_API_KEY=sk-ant-...

benchlist run mbpp \
  --service anthropic-claude \
  --model claude-opus-4-7 \
  --runs 3 \
  --out claude-mbpp.json

The runner will:

Download the pinned dataset (caches locally in ~/.benchlist/datasets/)
Hash the dataset and verify it matches the methodology
Query your service N times over the full problem set
Score with the pinned scoring function
Hash every transcript into a Merkle tree
Emit a canonical run.json

Publish a listing

Two options. CLI publishes directly; web lets you paste JSON.

CLI

benchlist commit claude-mbpp.json
benchlist prove  claude-mbpp.json --system sp1          # OR --system signed
benchlist submit claude-mbpp.json --network ethereum
# → batch_id: 0x3c5d...9a1b (waiting for verification...)
# → verified at block 22184921
benchlist verify claude-mbpp.json                       # replay locally
benchlist publish claude-mbpp.json                      # transcripts stripped by default
# → https://benchlist.ai/verify/run-claude-mbpp-001

Add --with-transcripts to publish if you want the full transcripts hosted (larger payload, prompts visible).

Web

Paste the output of benchlist prove into /submit. We verify the proof against Aligned's batch explorer and publish within 2 minutes.

Services

A service is an AI-adjacent product: an LLM API, a memory substrate, a code agent, a vector DB, etc. Each service has a stable ID (slug), a category, metadata, and a JSON schema.

Services don't host benchmark runs directly, runs reference the service by ID. This lets you update the service description or URL without invalidating historical scores.

Benchmarks

A benchmark suite is defined by two hashes:

datasetHash: SHA-256 of the canonical evaluation set
methodologyHash: SHA-256 of the runner repo at a specific commit

Change either, and you've created a new version of the benchmark. Old runs don't transfer. This prevents silent benchmark drift.

Runs + commitments

A run is a specific (service, model, config) executed against a specific benchmark suite. Every run produces:

a score (the primary metric)
a breakdown (per-category subscores, if applicable)
a transcript (the full list of (prompt, response, judge) tuples)
a Merkle root over the transcript
a commitment = hash(datasetHash || methodologyHash || merkleRoot || score)

The commitment is what actually gets signed and submitted to Aligned.

Attestors

An attestor is a runner that executes benchmarks and signs results. The reference attestor (benchlist-runner-0) is operated by Benchlist itself, but anyone can join the registry by:

Running benchlist attestor init, generates an Ed25519 keypair
Posting ≥ 1 ETH stake to the StakeVault contract
Publishing a DNS TXT record proving ownership of their domain
Submitting a PUT /attestors request with their pubkey + metadata

Misconduct (upheld disputes) slashes the stake.

Aligned Layer

Aligned is a proof aggregation network that settles on Ethereum L1. Every commitment produced by a runner is packaged as a proof, submitted to Aligned's operator set, and verified on-chain. Once verified, the batch ID becomes the listing's credential.

See the integration spec for wire format.