Skip to main content

Documentation Index

Fetch the complete documentation index at: https://usesapient.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Devtool Arena is a free, public leaderboard that benchmarks developer tools — APIs, SDKs, CLIs, and MCP servers — against real AI coding agents. Sapient runs live, end-to-end evaluations and publishes the results so API teams can see exactly how their tools rank, where they fall short, and how they compare to competitors. The leaderboard is updated multiple times per week, and each row shows a timestamp so you always know how fresh the data is.

Three leaderboard types

API leaderboard

Ranks APIs evaluated directly by Claude Code using live production API keys. The primary leaderboard with the broadest coverage — 85+ tools across 17 categories.

CLI leaderboard

Ranks developer tools by how well agents discover and use their command-line interfaces end-to-end. Covers 75 tools.

MCP leaderboard

Ranks tools with Model Context Protocol servers. Tests whether agents can use the MCP server to complete real tasks. Covers 58 tools.

Who uses the leaderboard

API and developer tool teams use Devtool Arena to answer questions they previously had no data for: Does my API work when Claude Code tries to use it? How do I rank against my competitors? What does an agent actually do when it encounters my documentation? Teams monitor their score after shipping documentation changes, MCP servers, or llms.txt files to measure the impact directly.

Categories covered

The leaderboard spans 17 categories of developer tools:

Inference

LLM inference APIs — Groq, OpenRouter, Cerebras, Fireworks AI, SambaNova, and more.

Auth

Authentication and identity platforms — Clerk, Auth0, WorkOS, Descope, Stytch, Scalekit.

Payment

Payment processing and billing APIs — Stripe, PayPal, Razorpay, Mollie, Paddle, Chargebee, LemonSqueezy, Airwallex.

Search

Search and web scraping APIs — Firecrawl, Jina AI, Tavily, Exa, You.com, Brave Search.

Voice STT

Speech-to-text APIs — AssemblyAI, Deepgram, OpenAI Whisper, Rev AI, Speechmatics.

Voice TTS

Text-to-speech APIs — ElevenLabs, Cartesia, Rime, LMNT, Resemble AI.

Voice Telephony

Programmable voice and telephony platforms — Twilio, Plivo, Telnyx, Vonage, Sinch, Infobip.

Voice Infra

Real-time voice infrastructure — LiveKit, Daily, Agora, Pipecat, 100ms.

Sandboxes

Cloud execution environments — E2B, Daytona, Modal, Vercel, Cloudflare Workers, Sprites.

Vector Databases

Vector search and embedding stores — Chroma, Pinecone, Qdrant, LanceDB, Weaviate, Zilliz Cloud.

Stablecoin

Crypto payments and stablecoin APIs — Circle, Coinbase Payments, DFNS, Fireblocks, Triple-A.

Meeting Bot

Meeting recording and transcription APIs — Recall.ai, Meeting BaaS, MeetGeek, Meetstream, CueMeet.

Durable Workflow

Workflow orchestration platforms — Temporal, Prefect, Restate, Camunda, Netflix Conductor, AWS Step Functions, Akka, Cadence.

Cloud Hosting

Cloud deployment platforms — Vercel, Railway, Render, Netlify, DigitalOcean, Cloudflare.

Observability

Monitoring and metrics APIs — Datadog.

Email

Programmatic email APIs — Resend, Agentmail.

How to access the leaderboard

Visit usesapient.com/leaderboard to view the full table. Click or tap any row to explore the detailed eval report for that tool. Use the category filters at the top to narrow the view to a specific vertical, and toggle Skipped to show entries where evaluation was not possible (for example, because no public API key was available). The leaderboard displays a Last updated timestamp so you can see exactly when the data was refreshed.

How to read the leaderboard table

Each row in the leaderboard represents one tool evaluated by Claude Code. Here is what each column means:
ColumnDescription
RankOverall position on the leaderboard, sorted by Score descending.
ToolThe name and logo of the evaluated tool.
ScoreOverall score from 0 to 100, combining Eval Score and Discovery Score.
GradeLetter grade (A, B, C, D) derived from the overall score.
EvalEval Score — how successfully the agent completed the assigned task end-to-end.
DiscoveryDiscovery Score — how easily the agent found and understood the API without hand-holding.
CostThe dollar cost in Claude API spend to run one evaluation. Lower is better.
CallsThe number of tool calls the agent made to complete the task. Fewer calls generally indicate a simpler, better-documented API.
ErrorsThe number of errors encountered during the evaluation run.
TimeWall-clock time to complete the evaluation task.
C7Whether the tool is indexed on Context7.
llmsWhether the tool has a published llms.txt file.
MCPWhether the tool provides an MCP server.
SDKWhether the tool offers an official SDK.
APIWhether the tool exposes a REST or similar API.
SkillsWhether the tool has published Agent Skills.
CLIWhether the tool ships a CLI.
The checklist columns (C7 through CLI) are binary — they indicate presence or absence of each asset in the tool’s publicly available resources.
Entries showing across Score and all metrics were skipped because Sapient could not obtain a working API key. Common reasons include requiring a paid plan before issuing credentials, requiring company verification, or being open-source / self-hosted with no hosted API available.

Get your API evaluated and listed

Any API or developer tool company can request an evaluation. Submit your documentation URL or GitHub repo URL at usesapient.com/leaderboard/add and Sapient will schedule a live evaluation run.

How scoring works

Understand how Eval Score and Discovery Score combine into your overall score, and what the grade thresholds mean.

Submit your API

Learn what to prepare before submitting and how to maximize your score from day one.

Changelog

See every update to the leaderboard: new companies, new categories, and ranking changes.