Devtool Arena is a free, public leaderboard that benchmarks developer tools — APIs, SDKs, CLIs, and MCP servers — against real AI coding agents. Sapient runs live, end-to-end evaluations and publishes the results so API teams can see exactly how their tools rank, where they fall short, and how they compare to competitors. The leaderboard is updated multiple times per week, and each row shows a timestamp so you always know how fresh the data is.Documentation Index
Fetch the complete documentation index at: https://usesapient.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Three leaderboard types
API leaderboard
Ranks APIs evaluated directly by Claude Code using live production API keys. The primary leaderboard with the broadest coverage — 85+ tools across 17 categories.
CLI leaderboard
Ranks developer tools by how well agents discover and use their command-line interfaces end-to-end. Covers 75 tools.
MCP leaderboard
Ranks tools with Model Context Protocol servers. Tests whether agents can use the MCP server to complete real tasks. Covers 58 tools.
Who uses the leaderboard
API and developer tool teams use Devtool Arena to answer questions they previously had no data for: Does my API work when Claude Code tries to use it? How do I rank against my competitors? What does an agent actually do when it encounters my documentation? Teams monitor their score after shipping documentation changes, MCP servers, or llms.txt files to measure the impact directly.Categories covered
The leaderboard spans 17 categories of developer tools:Inference
LLM inference APIs — Groq, OpenRouter, Cerebras, Fireworks AI, SambaNova, and more.
Auth
Authentication and identity platforms — Clerk, Auth0, WorkOS, Descope, Stytch, Scalekit.
Payment
Payment processing and billing APIs — Stripe, PayPal, Razorpay, Mollie, Paddle, Chargebee, LemonSqueezy, Airwallex.
Search
Search and web scraping APIs — Firecrawl, Jina AI, Tavily, Exa, You.com, Brave Search.
Voice STT
Speech-to-text APIs — AssemblyAI, Deepgram, OpenAI Whisper, Rev AI, Speechmatics.
Voice TTS
Text-to-speech APIs — ElevenLabs, Cartesia, Rime, LMNT, Resemble AI.
Voice Telephony
Programmable voice and telephony platforms — Twilio, Plivo, Telnyx, Vonage, Sinch, Infobip.
Voice Infra
Real-time voice infrastructure — LiveKit, Daily, Agora, Pipecat, 100ms.
Sandboxes
Cloud execution environments — E2B, Daytona, Modal, Vercel, Cloudflare Workers, Sprites.
Vector Databases
Vector search and embedding stores — Chroma, Pinecone, Qdrant, LanceDB, Weaviate, Zilliz Cloud.
Stablecoin
Crypto payments and stablecoin APIs — Circle, Coinbase Payments, DFNS, Fireblocks, Triple-A.
Meeting Bot
Meeting recording and transcription APIs — Recall.ai, Meeting BaaS, MeetGeek, Meetstream, CueMeet.
Durable Workflow
Workflow orchestration platforms — Temporal, Prefect, Restate, Camunda, Netflix Conductor, AWS Step Functions, Akka, Cadence.
Cloud Hosting
Cloud deployment platforms — Vercel, Railway, Render, Netlify, DigitalOcean, Cloudflare.
Observability
Monitoring and metrics APIs — Datadog.
Programmatic email APIs — Resend, Agentmail.
How to access the leaderboard
Visit usesapient.com/leaderboard to view the full table. Click or tap any row to explore the detailed eval report for that tool. Use the category filters at the top to narrow the view to a specific vertical, and toggle Skipped to show entries where evaluation was not possible (for example, because no public API key was available). The leaderboard displays a Last updated timestamp so you can see exactly when the data was refreshed.How to read the leaderboard table
Each row in the leaderboard represents one tool evaluated by Claude Code. Here is what each column means:| Column | Description |
|---|---|
| Rank | Overall position on the leaderboard, sorted by Score descending. |
| Tool | The name and logo of the evaluated tool. |
| Score | Overall score from 0 to 100, combining Eval Score and Discovery Score. |
| Grade | Letter grade (A, B, C, D) derived from the overall score. |
| Eval | Eval Score — how successfully the agent completed the assigned task end-to-end. |
| Discovery | Discovery Score — how easily the agent found and understood the API without hand-holding. |
| Cost | The dollar cost in Claude API spend to run one evaluation. Lower is better. |
| Calls | The number of tool calls the agent made to complete the task. Fewer calls generally indicate a simpler, better-documented API. |
| Errors | The number of errors encountered during the evaluation run. |
| Time | Wall-clock time to complete the evaluation task. |
| C7 | Whether the tool is indexed on Context7. |
| llms | Whether the tool has a published llms.txt file. |
| MCP | Whether the tool provides an MCP server. |
| SDK | Whether the tool offers an official SDK. |
| API | Whether the tool exposes a REST or similar API. |
| Skills | Whether the tool has published Agent Skills. |
| CLI | Whether the tool ships a CLI. |
Entries showing
— across Score and all metrics were skipped because Sapient could not obtain a working API key. Common reasons include requiring a paid plan before issuing credentials, requiring company verification, or being open-source / self-hosted with no hosted API available.Get your API evaluated and listed
Any API or developer tool company can request an evaluation. Submit your documentation URL or GitHub repo URL at usesapient.com/leaderboard/add and Sapient will schedule a live evaluation run.How scoring works
Understand how Eval Score and Discovery Score combine into your overall score, and what the grade thresholds mean.
Submit your API
Learn what to prepare before submitting and how to maximize your score from day one.
Changelog
See every update to the leaderboard: new companies, new categories, and ranking changes.