Coding LLM Benchmark

eXalt Value presents

AI Coding Agents Leaderboard

What each category measures:

  • Issue Resolution — Fixing bugs in real GitHub issues (SWE-Bench)
  • Frontend — UI development with visual context (SWE-Bench Multimodal)
  • Greenfield — Building new applications from scratch (Commit0)
  • Testing — Test generation and quality (SWT-Bench)
  • Information Gathering — Research and information retrieval (GAIA)

Leaderboard

Model Avg Score ↓ Avg Cost / run Avg Runtime Context Bugs Frontend Greenfield Testing Information
Claude-Opus-4-6 66.7 $1.14 410s 200K 76.8% ($0.77) 41.8% ($2.37) 56.2% ($1.70) 78.8% ($0.43) 80.0% ($0.44)
GPT-5.4 63.8 $1.82 372s 75.6% ($1.36) 36.8% ($4.24) 56.2% ($2.19) 71.4% ($0.55) 78.8% ($0.74)
Claude-Opus-4-5 60.6 $1.77 378s 200K 76.6% ($1.82) 41.2% ($2.54) 37.5% ($2.54) 78.5% ($1.38) 69.1% ($0.55)
GPT-5.2-Codex 59.5 $1.43 819s 400K 73.8% ($0.94) 35.9% ($2.97) 50.0% ($2.02) 67.0% ($0.66) 70.9% ($0.55)
Qwen3.6-Plus 57.9 $2.12 819s 74.2% ($1.52) 30.9% ($2.27) 50.0% ($4.40) 62.1% ($2.04) 72.1% ($0.34)
GPT-5.2 56.3 $1.20 596s 400K 74.6% ($0.86) 30.9% ($2.77) 37.5% ($1.34) 73.2% ($0.56) 65.5% ($0.48)
Gemini-3.1-Pro 55.7 $0.80 883s 10M 75.4% ($0.63) 44.1% ($1.24) 18.8% ($1.52) 64.0% ($0.50) 76.4% ($0.12)
Claude-Sonnet-4-5 53.0 $1.57 583s 200K 74.2% ($1.19) 36.8% ($1.89) 12.5% ($2.90) 68.8% ($0.98) 72.7% ($0.87)
GLM-5 49.4 $0.97 1323s 73.4% ($1.06) 35.3% ($0.58) 31.2% ($1.96) 47.3% ($0.91) 60.0% ($0.36)
Kimi-K2.5 49.2 $1.09 854s 256K 68.8% ($0.48) 32.8% ($1.58) 18.8% ($2.86) 61.9% ($0.42) 63.6% ($0.13)
Gemini-3-Pro 49.0 $1.42 1091s 10M 70.6% ($0.95) 36.8% ($1.46) 25.0% ($3.18) 68.6% ($1.01) 44.2% ($0.50)
Gemini-3-Flash 49.0 $0.64 726s 1M 74.6% ($0.42) 22.1% ($0.80) 18.8% ($1.28) 70.7% ($0.30) 58.8% ($0.38)
MiniMax-M2.5 46.5 $0.13 715s 1M 72.6% ($0.10) 25.0% ($0.15) 18.8% ($0.29) 68.1% ($0.07) 47.9% ($0.02)
Minimax-2.7 44.6 $0.36 1032s 75.6% ($0.17) 27.9% ($0.33) 25.0% ($0.94) 69.1% ($0.13) 25.5% ($0.21)
DeepSeek-V3.2 44.4 $0.13 1124s 128K 71.6% ($0.16) 27.9% ($0.19) 18.8% ($0.12) 53.6% ($0.12) 50.3% ($0.06)
Claude-Sonnet-4-6 43.3 $1.29 501s 1M 74.4% ($1.03) 30.9% ($2.24) 43.8% ($1.88) 54.0% ($0.87) 13.3% ($0.41)
GLM-4.7 41.0 $0.44 968s 203K 73.4% ($0.56) 22.1% ($0.66) 6.2% ($0.47) 49.4% ($0.37) 53.9% ($0.15)
MiniMax-M2.1 39.9 $0.22 1211s 197K 68.8% ($0.14) 16.2% ($0.21) 12.5% ($0.61) 61.4% ($0.11) 40.6% ($0.02)
Kimi-K2-Thinking 39.7 $1.76 1484s 256K 69.2% ($2.00) 32.4% ($2.31) 6.2% ($2.47) 47.3% ($1.39) 43.6% ($0.65)
Nemotron-3-Super 36.2 $0.57 1329s 62.0% ($0.44) 20.6% ($0.73) 12.5% ($1.23) 45.7% ($0.32) 40.0% ($0.11)
Qwen3-Coder-480B 30.9 $0.92 502s 262K 62.4% ($1.26) 23.5% ($2.09) 0.0% ($0.01) 34.9% ($0.97) 33.9% ($0.28)
Best Good Mid Low

OpenHands Team (2025). OpenHands Index: A Comprehensive Leaderboard for AI Coding Agents. index.openhands.dev