Coding LLM Benchmark

Benchmarks

Column Source What it measures
LiveCodeBench livecodebench.github.io (via Vellum, PricePerToken) Competitive programming from Codeforces, LeetCode, AtCoder. Fresh problems to prevent contamination. Score = pass rate %.
Aider Polyglot aider.chat/docs/leaderboards 225 Exercism coding exercises across C++, Go, Java, JavaScript, Python, and Rust. Measures real-world code editing (two attempts). Score = % correct.
SWE-bench Verified swebench.com (via Vellum) Can the model resolve real GitHub issues? Measures agentic reasoning and multi-file editing. Score = % resolved.
BFCL gorilla.cs.berkeley.edu/leaderboard (BFCL V4, Dec 2025) Berkeley Function Calling Leaderboard: tool/function-calling accuracy (AST, execution, relevance). Important for agent/MCP workflows. Score = overall accuracy %.
Code Arena Elo arena.ai/leaderboard/code Human preference voting (159,247 votes as of Feb 16 2026, 42 models). Users compare two models on WebDev coding tasks, blind. Score = Elo rating.

Composite sorting (Best coding assistant / Best agentic)

On the dashboard, two sort buttons combine several benchmarks into a single score so you can quickly compare models by use case:

  • Best coding assistant — Sorts by a composite of LiveCodeBench and Aider Polyglot. When both scores exist we use their average; when only one exists we use that value. Best for picking a model for day-to-day code completion and editing (competitive programming + real-world exercises).
  • Best agentic — Sorts by a composite of SWE-bench Verified and BFCL. Again we use the average when both exist, otherwise the single available score. Best for agent-style workflows: multi-step reasoning, resolving real issues, and tool/function calling.

Models with no data for a given composite always appear at the bottom of the list when that sort is active.

Pricing

Prices are in USD per 1 million tokens (input / output), sourced from pricepertoken.com and provider pricing pages. Prices shown are standard API rates (not batch/cached). Checked February 2026.

Model selection

We selected 20 models that appear in the top tiers across multiple benchmarks and are actively used for coding. The selection favors:

  • Models from major providers (Anthropic, OpenAI, Google, xAI, DeepSeek, Moonshot, MiniMax, Mistral)
  • Both flagship and cost-effective options per provider
  • Models available on Cursor and/or OpenRouter
  • Geographic diversity (USA, China, France)

Data compilation

Data was compiled on February 18, 2026. Benchmarks: Vellum (LiveCodeBench, SWE-bench, Aider, BFCL) Nov 2025; Aider polyglot leaderboard 2025; Arena Code leaderboard Feb 16, 2026 (159,247 votes); BFCL V4 Dec 2025; pricing from pricepertoken.com Feb 2026. Newer models like Claude Opus 4.6 may have incomplete benchmark data.

Value column

The Value column is our recommendation balancing score and price. We use the coding composite (LiveCodeBench + Aider average) as the score and the average of input and output $/1M as price, then compute score ÷ price. Models are tiered by quartile: top 25% = Best value, next 25% = Good value, then Mid and Low. Models with no coding score or no price show —. You can sort the table by Value to see the best bang-for-buck first.

← Back to dashboard