Devtool Arena: the coding agent API leaderboard

Devtool Arena is a free, public leaderboard that benchmarks developer tools — APIs, SDKs, CLIs, and MCP servers — against real AI coding agents. Sapient runs live, end-to-end evaluations and publishes the results so API teams can see exactly how their tools rank, where they fall short, and how they compare to competitors. The leaderboard is updated multiple times per week, and each row shows a timestamp so you always know how fresh the data is.

Three leaderboard types

API leaderboard

Ranks APIs evaluated directly by Claude Code using live production API keys. The primary leaderboard with the broadest coverage — 85+ tools across 17 categories.

CLI leaderboard

Ranks developer tools by how well agents discover and use their command-line interfaces end-to-end. Covers 75 tools.

MCP leaderboard

Ranks tools with Model Context Protocol servers. Tests whether agents can use the MCP server to complete real tasks. Covers 58 tools.

Who uses the leaderboard

API and developer tool teams use Devtool Arena to answer questions they previously had no data for: Does my API work when Claude Code tries to use it? How do I rank against my competitors? What does an agent actually do when it encounters my documentation? Teams monitor their score after shipping documentation changes, MCP servers, or llms.txt files to measure the impact directly.

Categories covered

The leaderboard spans 17 categories of developer tools:

Inference

LLM inference APIs — Groq, OpenRouter, Cerebras, Fireworks AI, SambaNova, and more.

Auth

Authentication and identity platforms — Clerk, Auth0, WorkOS, Descope, Stytch, Scalekit.

Payment

Payment processing and billing APIs — Stripe, PayPal, Razorpay, Mollie, Paddle, Chargebee, LemonSqueezy, Airwallex.

Search

Search and web scraping APIs — Firecrawl, Jina AI, Tavily, Exa, You.com, Brave Search.

Voice STT

Speech-to-text APIs — AssemblyAI, Deepgram, OpenAI Whisper, Rev AI, Speechmatics.

Voice TTS

Text-to-speech APIs — ElevenLabs, Cartesia, Rime, LMNT, Resemble AI.

Voice Telephony

Programmable voice and telephony platforms — Twilio, Plivo, Telnyx, Vonage, Sinch, Infobip.

Voice Infra

Real-time voice infrastructure — LiveKit, Daily, Agora, Pipecat, 100ms.

Sandboxes

Cloud execution environments — E2B, Daytona, Modal, Vercel, Cloudflare Workers, Sprites.

Vector Databases

Vector search and embedding stores — Chroma, Pinecone, Qdrant, LanceDB, Weaviate, Zilliz Cloud.

Stablecoin

Crypto payments and stablecoin APIs — Circle, Coinbase Payments, DFNS, Fireblocks, Triple-A.

Meeting Bot

Meeting recording and transcription APIs — Recall.ai, Meeting BaaS, MeetGeek, Meetstream, CueMeet.

Durable Workflow

Workflow orchestration platforms — Temporal, Prefect, Restate, Camunda, Netflix Conductor, AWS Step Functions, Akka, Cadence.

Cloud Hosting

Cloud deployment platforms — Vercel, Railway, Render, Netlify, DigitalOcean, Cloudflare.

Observability

Monitoring and metrics APIs — Datadog.

Email

Programmatic email APIs — Resend, Agentmail.

How to access the leaderboard

Visit usesapient.com/leaderboard to view the full table. Click or tap any row to explore the detailed eval report for that tool. Use the category filters at the top to narrow the view to a specific vertical, and toggle Skipped to show entries where evaluation was not possible (for example, because no public API key was available). The leaderboard displays a Last updated timestamp so you can see exactly when the data was refreshed.

How to read the leaderboard table

Each row in the leaderboard represents one tool evaluated by Claude Code. Here is what each column means:

Column	Description
Rank	Overall position on the leaderboard, sorted by Score descending.
Tool	The name and logo of the evaluated tool.
Score	Overall score from 0 to 100, combining Eval Score and Discovery Score.
Grade	Letter grade (A, B, C, D) derived from the overall score.
Eval	Eval Score — how successfully the agent completed the assigned task end-to-end.
Discovery	Discovery Score — how easily the agent found and understood the API without hand-holding.
Cost	The dollar cost in Claude API spend to run one evaluation. Lower is better.
Calls	The number of tool calls the agent made to complete the task. Fewer calls generally indicate a simpler, better-documented API.
Errors	The number of errors encountered during the evaluation run.
Time	Wall-clock time to complete the evaluation task.
C7	Whether the tool is indexed on Context7.
llms	Whether the tool has a published `llms.txt` file.
MCP	Whether the tool provides an MCP server.
SDK	Whether the tool offers an official SDK.
API	Whether the tool exposes a REST or similar API.
Skills	Whether the tool has published Agent Skills.
CLI	Whether the tool ships a CLI.

The checklist columns (C7 through CLI) are binary — they indicate presence or absence of each asset in the tool’s publicly available resources.

Entries showing — across Score and all metrics were skipped because Sapient could not obtain a working API key. Common reasons include requiring a paid plan before issuing credentials, requiring company verification, or being open-source / self-hosted with no hosted API available.

Get your API evaluated and listed

Any API or developer tool company can request an evaluation. Submit your documentation URL or GitHub repo URL at usesapient.com/leaderboard/add and Sapient will schedule a live evaluation run.

How scoring works

Understand how Eval Score and Discovery Score combine into your overall score, and what the grade thresholds mean.

Submit your API

Learn what to prepare before submitting and how to maximize your score from day one.

Changelog

See every update to the leaderboard: new companies, new categories, and ranking changes.

Documentation Index

​Three leaderboard types

API leaderboard

CLI leaderboard

MCP leaderboard

​Who uses the leaderboard

​Categories covered

Inference

Auth

Payment

Search

Voice STT

Voice TTS

Voice Telephony

Voice Infra

Sandboxes

Vector Databases

Stablecoin

Meeting Bot

Durable Workflow

Cloud Hosting

Observability

Email

​How to access the leaderboard

​How to read the leaderboard table

​Get your API evaluated and listed

How scoring works

Submit your API

Changelog

Three leaderboard types

Who uses the leaderboard

Categories covered

How to access the leaderboard

How to read the leaderboard table

Get your API evaluated and listed